There are two common ways to parse XML. One is to load the entire document, parse it into a tree structure, and return the tree as a manipulable object. The other is to view an XML document as a sequence of events - there are open tag events, character events, close tag events and others. This is the model we will explain here.
You should NEVER attempt to parse XML using regular expressions. It may work for a particular input file in a particular format, but it's a very brittle solution at best. One of the most frequent reasons for using regular expressions is the inability of XML parsers to deal with malformed input. Even if your programs read and write only well-formed XML, you may need to deal with outside data that's actually SGML or pseudo-SGML - at worst, free text with tags in angle brackets thrown in in some ad-hoc fashion. I suggest some methods for dealing with input like this below.
Parsing an XML document as a sequence of events means we don't have to keep the entire document in memory. This is usually the only feasible way to parse very large XML documents.
In Java, I like to use the org.xml.sax package, which
has been included with the JDK since 1.4. It allows us pass an
handler object that has callback methods to handle different kinds of
events. Events include the start and end of the document, start and
end of an element, character data, comments and a few other less
common things like CDATA
sections, processing
instructions and entity
references. org.xml.sax is an implementation of SAX, the Simple API for
XML. For more information about org.xml.sax, see the javadoc.