What is XML?

Let's dive right in with an example of some real XML:

<backslash>
   <story>
      <title>Israel extends offensive in Gaza</title>
      <url>http://www.indybay.org/news/2004/10/1699220.php</url>
      <time date="2004-10-13" time="06:34:34" />
      <author>BBC (reposted)</author>
      <department>text/plain</department>
      <topic>Palestine International</topic>
      <comments>0</comments>
      <section>Global</section>
      <body>
          <seg>The Israeli commander in Gaza said his troops met
          significant resistance from homemade bombs and anti-tank
          rockets.</seg>
          <seg>A member of Hamas militant movement was killed
          and two others were wounded by an Israeli missile fired at
          the town.</seg>
      </body>
      <image/>
   </story>
   <story>
       <title>Iraqi civilians and police die in Ramadi</title>
       <url>http://www.indybay.org/news/2004/10/1699219.php</url>
       <time date="2004-10-13" time="06:31:44" />
       <author>ALJ</author>
       <department>text/plain</department>
       <topic>Iraq International</topic>
       <comments>0</comments>
       <section>Global</section>
       <image/>
       <body>
           <seg>Two Iraqi civilians have died after a mortar
           bomb landed on a house near a regional government building
           in the city of Ramadi. The deaths came as US troops moved
           in to the town early on Wednesday, sealing off streets and
           searching buildings.</seg>
       </body>
   </story>
</backslash>

This looks a lot like HTML with funny node names, but there are some important things to note:

XML was designed to be easy to parse compared to its precursor SGML. Because of this, there is no such thing as XML that doesn't obey the above rules - any data that doesn't obey them is called non-well-formed and is simply not XML. There are a variety of other constraints that you can read about in the XML specification, but the above ones are the most important.

The point is so important that it bears repeating: All XML is well-formed - anything not well-formed is not XML. If you give an XML parser stuff that isn't XML, it won't attempt to figure out what you meant - it will just throw a fatal error.

HTML on the web rarely obeys any of the well-formedness constraints and so is not in general XML. All current web browsers try their best to parse even very poorly written HTML. XML parsers don't. Once an error is encountered, an XML parser is REQUIRED to stop parsing except (at its option) to find additional errors. Any program that does not do this is by definition not an XML parser. XML is designed to be simple and unambiguous. If XML parsers had the option of parsing non-well-formed input, different parsers would do different things, nobody would bother writing well-formed XML, and we'd have the same mess we do with HTML.

This can be really annoying, but if you work with the XML parser instead of trying to work around it then you can catch problems with your data and bugs in your program early on. It's even more frustrating if your initial data is something you have no control over, but later on I suggest some ways to clean things up automatically.