Let's dive right in with an example of some real XML:
<backslash>
<story>
<title>Israel extends offensive in Gaza</title>
<url>http://www.indybay.org/news/2004/10/1699220.php</url>
<time date="2004-10-13" time="06:34:34" />
<author>BBC (reposted)</author>
<department>text/plain</department>
<topic>Palestine International</topic>
<comments>0</comments>
<section>Global</section>
<body>
<seg>The Israeli commander in Gaza said his troops met
significant resistance from homemade bombs and anti-tank
rockets.</seg>
<seg>A member of Hamas militant movement was killed
and two others were wounded by an Israeli missile fired at
the town.</seg>
</body>
<image/>
</story>
<story>
<title>Iraqi civilians and police die in Ramadi</title>
<url>http://www.indybay.org/news/2004/10/1699219.php</url>
<time date="2004-10-13" time="06:31:44" />
<author>ALJ</author>
<department>text/plain</department>
<topic>Iraq International</topic>
<comments>0</comments>
<section>Global</section>
<image/>
<body>
<seg>Two Iraqi civilians have died after a mortar
bomb landed on a house near a regional government building
in the city of Ramadi. The deaths came as US troops moved
in to the town early on Wednesday, sealing off streets and
searching buildings.</seg>
</body>
</story>
</backslash>
This looks a lot like HTML with funny node names, but there are some important things to note:
For example, <image/>
is the same as <image></image>
For example,
<time date="2004-10-13" time="06:31:44" />
is acceptable. <time date=2004-10-13 time=06:31:44 /> is not.
Once we open an element, it must be closed before any of its outer elements are closed. So for example <b><i>bold italic text</b></i> is not well-formed XML - it must be written as <b><i>bold italic text</i></b>
In the example above it's
<backslash>...</backslash>. We couldn't have
a single XML document containing only <story>...</story><story>...</story>.
Note that the above two constraints mean that all XML documents have an explicit tree structure. In our examples, we won't really be using this, but many XML parsers work by building the complete document tree and returning this to the programmer as an object.
<, > or & characters
must be written as < >
and & respectively.
Note that the semicolon is required. These are the only characters that are required to be escaped. (This is actually not quite true - you can have literal < > and & inside a CDATA section but we won't worry about that for now. Also, strictly speaking, > isn't required to be escaped, but it's good practice to do so.)
If you're unfamiliar with Unicode, this would be a good time to skim the material on Character Sets and Encodings.
XML is defined in terms of a sequences of Unicode characters. The XML parser needs to know how interpret your document that way.
At the beginning of an XML document there can be an XML declaration that tells us what character set the parser should assume the document is in. If there is no XML declaration the document is assumed to be encoded as UTF-8.
For example, the following says that the document is ISO-8859-1, commonly used to write Western European languages such as English, French, German, Spanish and many others.
<?xml version="1.0" encoding='ISO-8859-1'?>
One common pitfall is to assume that ISO-8859-1 is compatible with UTF-8. Each character in Unicode is assigned a code point from U+0000 to U+FFFF, and it is true that the code points U+0000 to U+00FF are for the same characters as the bytes 0x00 to 0xFF in ISO-8859-1. However, U+0080 to U+00FF are not encoded as the bytes 0x80 to 0xFF when you use UTF-8! (Or any other Unicode encoding format - there's more than one way to encode Unicode! Confusing, yes?) You can read the Wikipedia article on UTF-8 to learn more about how UTF-8 encodes Unicode.
There are a few things such as ASCII control characters that are technically valid Unicode that are either discouraged or prohibited in XML. Section 2.2 of the XML spec lists them. I've written a perl script that will list and remove any of these characters from a file.
What this means is that if you aren't using UTF-8 and there are characters outside of plain old 7-bit ASCII (which is represented the same in UTF-8 as in all the ISO-8859 encodings and nearly every other encoding), you are required to specify the encoding.
IANA provides a list of character sets and encodings; also see JBrowse's list of character sets. Not all XML parsers understand all encodings, but all of them understand at least UTF-8. Also, different XML parsers may know the same encoding by two different names. Your best bet is to use a tool like uconv from ICU to convert to UTF-8 rather than trying to work with XML encoded as something other than UTF-8
XML was designed to be easy to parse compared to its precursor SGML. Because of this, there is no such thing as XML that doesn't obey the above rules - any data that doesn't obey them is called non-well-formed and is simply not XML. There are a variety of other constraints that you can read about in the XML specification, but the above ones are the most important.
The point is so important that it bears repeating: All XML is well-formed - anything not well-formed is not XML. If you give an XML parser stuff that isn't XML, it won't attempt to figure out what you meant - it will just throw a fatal error.
HTML on the web rarely obeys any of the well-formedness constraints and so is not in general XML. All current web browsers try their best to parse even very poorly written HTML. XML parsers don't. Once an error is encountered, an XML parser is REQUIRED to stop parsing except (at its option) to find additional errors. Any program that does not do this is by definition not an XML parser. XML is designed to be simple and unambiguous. If XML parsers had the option of parsing non-well-formed input, different parsers would do different things, nobody would bother writing well-formed XML, and we'd have the same mess we do with HTML.
This can be really annoying, but if you work with the XML parser instead of trying to work around it then you can catch problems with your data and bugs in your program early on. It's even more frustrating if your initial data is something you have no control over, but later on I suggest some ways to clean things up automatically.