DS command line program

Overview
Requirements
Running DS
DSN configuration
HTML parsing
Options
Scanning web page sequences with NextURL
DS samples

Overview

DS.exe ("DS" hereafter) is a Windows command line program that runs Data Splitter.   There are two command formats :

ds <dss file name>
ds <data source name> <task table name>

In the first, 1-argument format the specified solution (DSS) file controls DS.   The solution must contain a list of input files, emails or URLs.   If multiple input lists are specified one will be chosen; see the Run topic for more information.   The transformation defined in the DSS file will be run on the chosen input list.

In the second, 2-argument format a task table in a database controls DS.   The first argument is the ODBC Data Source Name (DSN) of the database.   The second argument is the name of a task table in that database.   The remainder of this document describes how to configure and run DS under the control of a database task table.

DS database control

Under database control DS is driven by 3 interrelated database table types :

The table names are user-defined, but each table type has a specific format.   The fields indicated in the table descriptions below must be present, but the user can add fields to the tables as needed.

The 2 command line arguments identify a task table within a database :

ds <DSN> <task table>
  1. the ODBC Data Source Name (DSN),
  2. the name of a task table in that database.

A database can contain more than one task table, but a DS run is controlled by a single task table.

For example, this command runs the census table scraper sample :

"C:\Program Files\DataSplitter.com\DataSplitter\DS" DataSplitter-Census-01 Tasks

"DataSplitter-Census-01" is the DSN, "Tasks" is the task table name.   See the Running DS topic for more information.

The task table in turn refers to source and SQL tables, as described below.


The task table

A single task table controls the process, instructing DS to :

Task table
Field: Description:
N The action index (1..N)
Exec Execute this task (yes/no)?
Action The type of task to perform:  "Scan" or "ExecSQL"
Parameter Data to be used by the task (a source or SQL table)

DS scans the task table executing the user-defined actions, which can be :

Scan - scan the URLs in the Source table whose name is specified in the parameter field
ExecSQL - execute the statements in the SQL table whose name is specified in the parameter field

The Scan and ExecSQL tasks can be run in any order.   For example, you might want to execute SQL before and after a URL scan.

Example

The following task table :

  1. runs the SQL in table PreSQL,
  2. scans the URL list in table WebSources,
  3. runs the SQL in table PostSQL.
N Exec Action Parameter
1 X ExecSQL PreSQL
2 X Scan WebSources
3 X ExecSQL PostSQL

The Exec field allows the user to switch tasks on and off without having to add / remove them from the table.

This design is flexible enough to allow multiple passes, where, for example, a list of URLs generated by one step is scanned by the next step.   Data from multi-level websites can be extracted using this feature.


Source tables

The URLs and associated data are specified in "source" database tables containing the following information :

Source table
Field:Description:
Scan Scan this URL (yes/no)?
Source User's description of the URL; passed to Data Splitter variable "Source"
URL The Uniform Resource Locator
Parser The Data Splitter .DSS file to be used to parse the URL's data
Options Option string
UserName The user ID (if any) required to access the URL
Password The password (if any) required to access the URL
Notes User notes, free-format, ignored by DS

DS scans Source tables from beginning to end.   Each URL is parsed per the associated Data Splitter parser (.DSS file).

Example

The following Source table instructs DS to fetch and parse data from 3 URLs :

Scan Source URL Parser
X Con Ed http://finance.yahoo.com/q?s=ED Quotes01
X Borland http://finance.yahoo.com/q?s=BORL Quotes01
X GMOT http://finance.yahoo.com/q?s=GMA Quotes01

All three URLs are parsed with the Quotes01.DSS parser.   The Options, UserName and Password fields, which are not typically used, are not shown in this example.


SQL tables

The task table can trigger the execution of SQL (database Structured Query Language) statements in tables containing the following information :

SQL table
Field: Description:
N An index (1..N) controlling statement execution order
Exec Execute this SQL statement (yes/no)?
Command An SQL statement (must be valid!)

SQL tables can be used to prepare data in the Source tables, for example, or to clear an output table prior to a scan.

Examples

The following SQL table instructs DS to execute an SQL statement to clear the Output table prior to a scan :

N Exec Command
1 X delete from Output

The following SQL table instructs DS to execute 2 SQL statements to prepare the ContributionsSources table for a URL scan :

N Exec Command
1 X update ContributionsSources set Parser='Contributions'
2 X update ContributionsSources set Source='Contributions - '+zip

In this case the ContributionsSources table has an extra field, "zip".   Fields can be added to the DS tables if necessary to facilitate SQL manipulations.

Once again, an Exec field allows the user to switch tasks on and off without having to add / remove them from the table.

The Command field must contain a valid SQL statement.


Running DS

DS is run from the command line with 1 or 2 arguments :

ds <dss file name>
ds <DSN> <task table>

In the first, single-argument format, DS runs under the control of the specified solution (DSS) file.

In the second, 2-argument format, DS runs under database control using :

  1. the database identified by the ODBC Data Source Name (DSN),
  2. the specified task table within that database.

Enter these command lines at the Windows command prompt, or other places where command lines are used, such as :

For example, to run the census table scraper sample at the command prompt, enter :

"C:\Program Files\DataSplitter.com\DataSplitter\DS" DataSplitter-Census-01 Tasks

DSN configuration

DS uses ODBC, "Open Database Connectivity", to interact with user-defined databases.   Under ODBC databases are assigned labels, or Data Source Names (DSNs).   The DSNs for the sample databases are automatically configured during installation.

To use another database the user must configure a DSN for the database.   DSN configuration means, essentially, applying a system-wide label to an existing database.   Use the "Configure data sources" shortcut created during installation to start the ODBC Data Source Administrator, then :

Visit datasplitter.com/database.html for more information about databases, DSNs and ODBC.


HTML parsing

Web pages are written using HTML, Hypertext Markup Language.   DS parsers are, therefore, HTML parsers, so configuring DS requires some knowledge of HTML.   For example, parsing an HTML table requires an understanding of the TABLE, TR, TH, and TD tags, among others.

In simple cases HTML tags can be ignored.   For more sophisticated parsing the user should be familiar with the basics of HTML:  many good books are available on the subject.

Most web browsers allow the user to view the HTML source of a web page (View / Source in Microsoft Internet Explorer 8, View / Page Source in Mozilla Firefox 3).   The source is displayed with tags and content in different colors, making it easier to see the structure of the web page.   DS supports a "noparse" option that allows saving the URL content to a local file.

Data Splitter's design allows creation of reusable parsing components : node groups.   For example, the following HTML parsing node groups are installed with DS Census table scraper sample :

HTML-element Generic HTML tag parser
HTML-entity Generic HTML entity parser
TR HTML "TR" (table row) parser
TD HTML "TD" (table data) parser
Extract-Digits digit ('0' .. '9') extractor

Node groups greatly simplify top-level .DSS solutions.   See Data Splitter Help for more information.


DS options

DS supports several options that can be specified on a per-URL basis, in the Source table "options" field :

Option: Purpose:
noparse Writes the URL's content directly to a file without parsing it ¹
trace Increases the amount of diagnostic output produced
retries Sets the maximum number of retries if an HTTP redirect occurs ²
decode Parses the URL using the AfxParseURLEx ICU_DECODE option ³
noencode " AfxParseURLEx ( ICU_NO_ENCODE )
nometa " AfxParseURLEx ( ICU_NO_META )
encodespacesonly " AfxParseURLEx ( ICU_ENCODE_SPACES_ONLY )
browsermode " AfxParseURLEx ( ICU_BROWSER_MODE )

¹ The "noparse" option enables you to write the fetched URL's content directly to a local file, without modification.   The output file path is specified in the sources table parser field.   This option provides the unmodified results of the HTTP Get command.   This can be very useful for parser development.

² The maximum number of HTTP redirects defaults to 3.

³ DS uses a Microsoft Foundation Class (MFC) function called "AfxParseURLEx" to parse the URL prior to fetching the data.   Visit http://msdn.microsoft.com for more information on AfxParseURLEx and its options.

Example

Scan Source URL Parser Options
X Sample data http://datasplitter.com/sample.html Census-01 retries=0
X Sample data http://datasplitter.com/sample.html c:\urldump\sample.html noparse
X Sample data http://datasplitter.com/sample.html Census-01 trace noencode

The 1st example sets the number of HTTP redirects to zero.

The 2nd example writes the fetched URL's content, unmodified, to file c:\urldump\sample.html.

The 3rd example illustrates the combination of two options, separated by a blank.


Scanning web page sequences with NextURL

DS can scan a sequence of web pages if those web pages contain links to "next" pages.   The Data Splitter parser must define a string variable called "NextURL" and set it appropriately.   DS will continue fetching pages as long as "NextURL" contains a valid URL.   The NextURL feature allows an unlimited sequence of web pages to be fetched for a single URL entry in the Sources table.

The parser will typically clear NextURL at first, then set it only if a "next" link is recognized.   This is a powerful feature, but care must be taken to avoid infinite loops!


DS requirements


DS samples

DS sample databases are available at the Data Splitter website :

The sample installers automatically configure their ODBC DSNs.   The installers run the DSNConfig program, which use the *-DSN.txt files to configure the DSNs.   The sample uninstallers remove the DSNs created at installation.


DS sample : Census table scraper

This DS sample extracts data from an HTML table at http://factfinder.census.gov.

It uses 4 database tables :

  1. Tasks
  2. Sources
  3. Output
  4. SQL1

Keep in mind that the word "table" is used two ways here :

Two tasks are configured in the Tasks table :

N Exec Action Parameter
1 X ExecSQL SQL1
2 X Scan Sources

The SQL1 database table contains a single SQL statement :

delete from Output

This clears the Output table prior to the scan.

The Sources database table contains the URL / parser list :

Scan Source URL Parser
X U.S. Census Bureau http://factfinder.census.gov/ ... Census-01.dss

The Census-01.DSS parser starts by looking for a "TableLocator" (specified as an entry in a Data Splitter string set), in order to skip over page content preceding the HTML table to be scraped.   The entries in the TableLocator string set contain keywords that distinguish the desired table from other tables that may precede it in the web page being parsed :

screen shot: HTML table parser

Sample output

Name Population Area Grabbed
California 33871648 163695.57 10/25/2009 12:03:28 AM
Texas 20851820 268580.82 10/25/2009 12:03:28 AM
New York 18976457 54556 10/25/2009 12:03:28 AM
... ... ... ...


*Home *Download *Help *Site index *FAQs *Support