XML parsing

Data Splitter can parse XML - Extensible Markup Language.   Consider this "book data" sample :

   <book_list>
       <book ord="1">
           <book_id>1234456</book_id>
           <title>Isn't Life Amazing?</title>
           <additional_info>
               <price>19.99</price>
               <evaluation>So readable!</evaluation>
           </additional_info>
       </book>
   <book_list>
		

The parser for "book_list" XML looks like this :

screen shot: XML parser, book data, top level

Patterns recognized :

* Pattern:  zero or more of any value
WS Pattern:  zero or more white spaces (blanks / tabs / carriage returns / line feeds)
MainData String set:  maps book data (<book_id>, <title>) to destinations (e.g. database fields)
XMLTag Node group:  parses XML tags that are to be ignored (generic)
ClosingTag Node group:  simple XML closing tag parser (generic)
additional_info Node group:  parses <additional_info> (<price>, <evaluation>)

XML parsers make extensive use of node groups:  "XMLTag", "ClosingTag" and "additional_info" in this example.   Node groups are the natural way to handle "nested" XML tags using Data Splitter.

Tag content can be a sequence of XML tags.   From a parser's standpoint, this means :
  some whitespace
  followed by a tag
  followed by some more whitespace
  followed by another tag
  followed by ...

So, tag sequences all begin by looking for "WS" (whitespace).   In general, "WS" can be followed by three types of items :

More specifically (notice the change in the order !!!) :

  1. tags of interest,
  2. the closing tag (ClosingTag, or "</book>" in this example),
  3. tags to be ignored (XMLTag).

The order is important here because "XMLTag" is a "catch-all" node, and the link to it must be attempted last, i.e. have the highest link number.   "XMLTag" matches both the "tags of interest" and the closing tag.   The best way to handle this, after adding all nodes, is to set the link number of the "XMLTag" to some large number, e.g. 999 - Data Splitter will compute the correct (largest) link number for you.

In the above example, the contents of the "<book>" tag are WS followed by :

  1. MainData,
  2. additional_info,
  3. the closing tag, "</book>",
  4. tags to be ignored (XMLTag).

If you were to add another category, say "additional_info_2", you would create a new "additional_info_2" node group (discussed below), add a new "additional_info_2" node, link the WS node to it, link the new node back to the WS node, then set the "red" link to the "XMLTag" node to a large number, which Data Splitter would automatically compute as 5.

Node group "XMLTag" is necessary because all XML data must be parsed so that the parser can keep track of "where it's at".   Node group "XMLTag" recursively parses XML data that is to be ignored, i.e. the "everything else" category.   In most cases a WS node will be paired with an "XMLTag" node.

Here's the "additional_info" node group :

screen shot: XML parser - general-purpose tag parser

Note that "additional_info" also follows the rules regarding whitespace and XMLTag.   The XML tags to be parsed are defined in string set "AdditionalData" :

Text Other text Target
<price> books.price
<evaluation> books.evaluation

The "AdditionalData" node corresponds to the tags of interest, defined in the string set (above).   The * node corresponds to the actual data content, so it's "where the action is", the action in this case being "Send to: AdditionalData".   See the string sets topic for more information.


Now let's examine the "XMLTag" node group.   It performs no actions other than to "Return" when the final ">" is encountered.   Note that it is "recursive", i.e. it refers to itself - observe the "NestedTag" node.

screen shot: XML parser - generic recursive tag parser

Finally, here's the closing tag parser.   It simply recognizes the opening "<" then scans until the closing ">" is encountered, at which point it executes a "Return" action :

screen shot: XML parser - closing tag parser

Note that "ClosingTag" simply scans over the closing tag without concern for the tag value, i.e. without concern for whether or not the closing tag matches its opening tag.

Other things to note about XML parsing :