This parser first searches the web page for one of the entries in the "TableLocator" string set. This allows skipping forward to the table to be scraped.
The parser then skips three table rows (the 3 TR nodes with no actions), and starts collecting data at the fourth table row (TR).
For each remaining row in the table it parses the first five table data (TD) tags :
- the first column is skipped,
- the contents of the second column are sent to the output table's Name field,
- the third column is skipped,
- the contents of the fourth column are sent to the output table's Population field,
- the contents of the fifth column are sent to the output table's Area field.
The numeric fields are "cleaned up" with the Extract-Digits node group. Extract-Digits just removes commas for now, so the fields can be loaded into numeric database fields. Extract-Digits can be extended to remove other non-numeric characters.
This parser is simplified by the use of several other node groups :
TR | - HTML table row parser |
TD | - HTML table cell parser |
HTML-element | - general-purpose HTML element parser |
HTML-entity | - HTML entity parser |
This parser is included as part of the DS Census table scraper sample. It can be adapted to parse HTML tables with any number of columns.