Example : web page number scraper
Here's a simple web scraping example: Web-Extract-Number.dss, which extracts numbers from the specified web pages.
Many generated web pages contain labeled numbers, i.e. labels followed by a number :
- a label followed by
- (some other stuff) followed by
- the number we're interested in.
The "other stuff" is invisible HTML code that can simply be ignored for this scraping application.
We also want to extract some type of unique identifier from the page. This particular scraper looks for the contents of the first HTML <h1> tag :
The parser locates the first <h1> tag on the page and extracts the text between it and the closing </h1> tag. You can adapt this parser to use the page's <title> tag, or any other identifying HTML tag (element), by changing the start node's string :
- right click on the start node (top left, labeled "Start"),
- select the String option from the drop-down menu,
- replace "<h1>" with the new tag and press OK.
After the identifier the parser searches for the specified label on the page, then looks for a DecimalNumber pattern, i.e. some digits, a decimal point, and two more digits. When it finds that it executes FoundNumber, an action group that transmits the extracted data and other information to the target (file or database, depending on how the target is defined).
This simple example extracts two pieces of information from each web page :
- the identifying text
- a single decimal number
The output is a table that contains, for each web page, both extracted values plus :
- the date and time of extraction
- a link to the source URL
This example also uses a file variable, a small file that contains the HTML header inserted at the top of the output (Web-Extract-Number-Header.txt). You can modify this text file to change the output's appearance (font, color, etc.).
This web scraping / HTML parsing example can be modified to extract other numeric formats, or other string patterns, from the web page list.