Example: web page text scraper

Example : web page text scraper

This HTML parser, Web-Extract-Title-Header.dss, extracts the title and a header from each of the specified web pages :

screen shot: web page title and header extraction

It starts by searching for the <title> start tag. Everything between that and the next end tag is associated with the Title node. When the end tag is recognized the OutputTitle action group runs. This action group formats the recognized title and outputs it.

The parser then searches for one of the header tags contained in the HeaderTag string set.

The next task is to build the Header variable, which (if done correctly) will contain the visible text between the header start tag and its end tag ("</h"). The text between the start tag and the end tag may contain HTML tags which must be filtered out.

When the header start tag (HeaderTag) is recognized the Header variable is cleared. The text between the header start and end tags is appended to the Header variable except :

if it's a <br> tag, or
if it's any other HTML element.

HTML <br> tags are replaced with blanks. All other HTML elements are simply ignored.

When the header end tag is recognized the OutputHeader action group runs. This action group formats the gathered header tag content and outputs it.

This sample parser is configured to grab the title text and the content of a single header header element from a list of news organization URLs. It works for all three websites specified in the sample, but will usually require modification to correctly parse other websites. See Data Splitter Help for more information.