Parsing XML

There are two common ways to parse XML. One is to load the entire document, parse it into a tree structure, and return the tree as a manipulable object. The other is to view an XML document as a sequence of events - there are open tag events, character events, close tag events and others. This is the model we will explain here.

You should NEVER attempt to parse XML using regular expressions. It may work for a particular input file in a particular format, but it's a very brittle solution at best. One of the most frequent reasons for using regular expressions is the inability of XML parsers to deal with malformed input. Even if your programs read and write only well-formed XML, you may need to deal with outside data that's actually SGML or pseudo-SGML - at worst, free text with tags in angle brackets thrown in in some ad-hoc fashion. I suggest some methods for dealing with input like this below.

Event-Driven XML Parsing

Parsing an XML document as a sequence of events means we don't have to keep the entire document in memory. This is usually the only feasible way to parse very large XML documents.

In Java, I like to use the org.xml.sax package, which has been included with the JDK since 1.4. It allows us pass an handler object that has callback methods to handle different kinds of events. Events include the start and end of the document, start and end of an element, character data, comments and a few other less common things like CDATA sections, processing instructions and entity references. org.xml.sax is an implementation of SAX, the Simple API for XML. For more information about org.xml.sax, see the javadoc.

First Steps - Seeing SAX Events

Content Extraction - Presenting Newswire Text

Content Modification - Translating Chinese Dates

Dealing with Malformed Input