DS command line program
Overview |
Requirements |
Running DS |
DSN configuration |
HTML parsing |
Options |
Scanning web page sequences with NextURL |
DS samples |
Overview
DS.exe ("DS" hereafter) is a Windows command line program that runs Data Splitter. There are two command formats :
ds <dss file name>
ds <data source name> <task table name>
In the first, 1-argument format the specified solution (DSS) file controls DS. The solution must contain a list of input files, emails or URLs. If multiple input lists are specified one will be chosen; see the Run topic for more information. The transformation defined in the DSS file will be run on the chosen input list.
In the second, 2-argument format a task table in a database controls DS. The first argument is the ODBC Data Source Name (DSN) of the database. The second argument is the name of a task table in that database. The remainder of this document describes how to configure and run DS under the control of a database task table.
DS database control
Under database control DS is driven by 3 interrelated database table types :
- Task tables
- Source tables
- SQL tables
The table names are user-defined, but each table type has a specific format. The fields indicated in the table descriptions below must be present, but the user can add fields to the tables as needed.
The 2 command line arguments identify a task table within a database :
ds <DSN> <task table>
- the ODBC Data Source Name (DSN),
- the name of a task table in that database.
A database can contain more than one task table, but a DS run is controlled by a single task table.
For example, this command runs the census table scraper sample :
"C:\Program Files\DataSplitter.com\DataSplitter\DS" DataSplitter-Census-01 Tasks
"DataSplitter-Census-01" is the DSN, "Tasks" is the task table name. See the Running DS topic for more information.
The task table in turn refers to source and SQL tables, as described below.
The task table
A single task table controls the process, instructing DS to :
- Scan URLs
- Execute SQL
Task table | |
---|---|
Field: | Description: |
N | The action index (1..N) |
Exec | Execute this task (yes/no)? |
Action | The type of task to perform: "Scan" or "ExecSQL" |
Parameter | Data to be used by the task (a source or SQL table) |
DS scans the task table executing the user-defined actions, which can be :
Scan | - scan the URLs in the Source table whose name is specified in the parameter field |
---|---|
ExecSQL | - execute the statements in the SQL table whose name is specified in the parameter field |
The Scan and ExecSQL tasks can be run in any order. For example, you might want to execute SQL before and after a URL scan.
Example
The following task table :
- runs the SQL in table PreSQL,
- scans the URL list in table WebSources,
- runs the SQL in table PostSQL.
N | Exec | Action | Parameter |
---|---|---|---|
1 | X | ExecSQL | PreSQL |
2 | X | Scan | WebSources |
3 | X | ExecSQL | PostSQL |
The Exec field allows the user to switch tasks on and off without having to add / remove them from the table.
This design is flexible enough to allow multiple passes, where, for example, a list of URLs generated by one step is scanned by the next step. Data from multi-level websites can be extracted using this feature.
Source tables
The URLs and associated data are specified in "source" database tables containing the following information :
Source table | |
---|---|
Field: | Description: |
Scan | Scan this URL (yes/no)? |
Source | User's description of the URL; passed to Data Splitter variable "Source" |
URL | The Uniform Resource Locator |
Parser | The Data Splitter .DSS file to be used to parse the URL's data |
Options | Option string |
UserName | The user ID (if any) required to access the URL |
Password | The password (if any) required to access the URL |
Notes | User notes, free-format, ignored by DS |
DS scans Source tables from beginning to end. Each URL is parsed per the associated Data Splitter parser (.DSS file).
Example
The following Source table instructs DS to fetch and parse data from 3 URLs :
Scan | Source | URL | Parser |
---|---|---|---|
X | Con Ed | http://finance.yahoo.com/q?s=ED | Quotes01 |
X | Borland | http://finance.yahoo.com/q?s=BORL | Quotes01 |
X | GMOT | http://finance.yahoo.com/q?s=GMA | Quotes01 |
All three URLs are parsed with the Quotes01.DSS parser. The Options, UserName and Password fields, which are not typically used, are not shown in this example.
SQL tables
The task table can trigger the execution of SQL (database Structured Query Language) statements in tables containing the following information :
SQL table | |
---|---|
Field: | Description: |
N | An index (1..N) controlling statement execution order |
Exec | Execute this SQL statement (yes/no)? |
Command | An SQL statement (must be valid!) |
SQL tables can be used to prepare data in the Source tables, for example, or to clear an output table prior to a scan.
Examples
The following SQL table instructs DS to execute an SQL statement to clear the Output table prior to a scan :
N | Exec | Command |
---|---|---|
1 | X | delete from Output |
The following SQL table instructs DS to execute 2 SQL statements to prepare the ContributionsSources table for a URL scan :
N | Exec | Command |
---|---|---|
1 | X | update ContributionsSources set Parser='Contributions' |
2 | X | update ContributionsSources set Source='Contributions - '+zip |
In this case the ContributionsSources table has an extra field, "zip". Fields can be added to the DS tables if necessary to facilitate SQL manipulations.
Once again, an Exec field allows the user to switch tasks on and off without having to add / remove them from the table.
The Command field must contain a valid SQL statement.
Running DS
DS is run from the command line with 1 or 2 arguments :
ds <dss file name>
ds <DSN> <task table>
In the first, single-argument format, DS runs under the control of the specified solution (DSS) file.
In the second, 2-argument format, DS runs under database control using :
- the database identified by the ODBC Data Source Name (DSN),
- the specified task table within that database.
Enter these command lines at the Windows command prompt, or other places where command lines are used, such as :
- Windows shortcuts
- batch files
- the Windows task scheduler
For example, to run the census table scraper sample at the command prompt, enter :
"C:\Program Files\DataSplitter.com\DataSplitter\DS" DataSplitter-Census-01 Tasks
DSN configuration
DS uses ODBC, "Open Database Connectivity", to interact with user-defined databases. Under ODBC databases are assigned labels, or Data Source Names (DSNs). The DSNs for the sample databases are automatically configured during installation.
To use another database the user must configure a DSN for the database. DSN configuration means, essentially, applying a system-wide label to an existing database. Use the "Configure data sources" shortcut created during installation to start the ODBC Data Source Administrator, then :
- Select the User DSN tab
- Click the Add button
- Select the appropriate ODBC driver - for the sample Access database it will be "Microsoft Access Driver"
- Click Finish (though you're not quite finished ...)
- In the Data Source Name box enter the label, e.g. "CensusData"
- Enter a description for the database (optional)
- Click Select to locate the database in the file system, e.g. "C:\ProgramData\DataSplitter.com\DataSplitter"
- Select the database file/directory and click OK
- Keep clicking OK to get out of the ODBC Data Source Administrator
Visit datasplitter.com/database.html for more information about databases, DSNs and ODBC.
HTML parsing
Web pages are written using HTML, Hypertext Markup Language. DS parsers are, therefore, HTML parsers, so configuring DS requires some knowledge of HTML. For example, parsing an HTML table requires an understanding of the TABLE, TR, TH, and TD tags, among others.
In simple cases HTML tags can be ignored. For more sophisticated parsing the user should be familiar with the basics of HTML: many good books are available on the subject.
Most web browsers allow the user to view the HTML source of a web page (View / Source in Microsoft Internet Explorer 8, View / Page Source in Mozilla Firefox 3). The source is displayed with tags and content in different colors, making it easier to see the structure of the web page. DS supports a "noparse" option that allows saving the URL content to a local file.
Data Splitter's design allows creation of reusable parsing components : node groups. For example, the following HTML parsing node groups are installed with DS Census table scraper sample :
HTML-element | Generic HTML tag parser |
---|---|
HTML-entity | Generic HTML entity parser |
TR | HTML "TR" (table row) parser |
TD | HTML "TD" (table data) parser |
Extract-Digits | digit ('0' .. '9') extractor |
Node groups greatly simplify top-level .DSS solutions. See Data Splitter Help for more information.
DS options
DS supports several options that can be specified on a per-URL basis, in the Source table "options" field :
Option: | Purpose: |
---|---|
noparse | Writes the URL's content directly to a file without
parsing it ¹ |
trace | Increases the amount of diagnostic output produced |
retries | Sets the maximum number of retries if an HTTP redirect occurs ² |
decode | Parses the URL using the AfxParseURLEx ICU_DECODE option ³ |
noencode | " AfxParseURLEx ( ICU_NO_ENCODE ) |
nometa | " AfxParseURLEx ( ICU_NO_META ) |
encodespacesonly | " AfxParseURLEx ( ICU_ENCODE_SPACES_ONLY ) |
browsermode | " AfxParseURLEx ( ICU_BROWSER_MODE ) |
¹ The "noparse" option enables you to write the fetched URL's content directly to a local file, without modification. The output file path is specified in the sources table parser field. This option provides the unmodified results of the HTTP Get command. This can be very useful for parser development.
² The maximum number of HTTP redirects defaults to 3.
³ DS uses a Microsoft Foundation Class (MFC) function called "AfxParseURLEx" to parse the URL prior to fetching the data. Visit http://msdn.microsoft.com for more information on AfxParseURLEx and its options.
Example
Scan | Source | URL | Parser | Options |
---|---|---|---|---|
X | Sample data | http://datasplitter.com/sample.html | Census-01 | retries=0 |
X | Sample data | http://datasplitter.com/sample.html | c:\urldump\sample.html | noparse |
X | Sample data | http://datasplitter.com/sample.html | Census-01 | trace noencode |
The 1st example sets the number of HTTP redirects to zero.
The 2nd example writes the fetched URL's content, unmodified, to file c:\urldump\sample.html.
The 3rd example illustrates the combination of two options, separated by a blank.
Scanning web page sequences with NextURL
DS can scan a sequence of web pages if those web pages contain links to "next" pages. The Data Splitter parser must define a string variable called "NextURL" and set it appropriately. DS will continue fetching pages as long as "NextURL" contains a valid URL. The NextURL feature allows an unlimited sequence of web pages to be fetched for a single URL entry in the Sources table.
The parser will typically clear NextURL at first, then set it only if a "next" link is recognized. This is a powerful feature, but care must be taken to avoid infinite loops!
DS requirements
- Microsoft Windows 32-bit
- 10 megabytes of free disk space
- a database management system (DBMS), if running under database control or outputting to a database
- an Internet connection, if reading input from a URL list
DS samples
DS sample databases are available at the Data Splitter website :
The sample installers automatically configure their ODBC DSNs. The installers run the DSNConfig program, which use the *-DSN.txt files to configure the DSNs. The sample uninstallers remove the DSNs created at installation.
DS sample : Census table scraper
This DS sample extracts data from an HTML table at http://factfinder.census.gov.
It uses 4 database tables :
- Tasks
- Sources
- Output
- SQL1
Keep in mind that the word "table" is used two ways here :
- database table
- HTML table
Two tasks are configured in the Tasks table :
N | Exec | Action | Parameter |
---|---|---|---|
1 | X | ExecSQL | SQL1 |
2 | X | Scan | Sources |
The SQL1 database table contains a single SQL statement :
delete from Output
This clears the Output table prior to the scan.
The Sources database table contains the URL / parser list :
Scan | Source | URL | Parser |
---|---|---|---|
X | U.S. Census Bureau | http://factfinder.census.gov/ ... | Census-01.dss |
The Census-01.DSS parser starts by looking for a "TableLocator" (specified as an entry in a Data Splitter string set), in order to skip over page content preceding the HTML table to be scraped. The entries in the TableLocator string set contain keywords that distinguish the desired table from other tables that may precede it in the web page being parsed :
Sample output
Name | Population | Area | Grabbed |
---|---|---|---|
California | 33871648 | 163695.57 | 10/25/2009 12:03:28 AM |
Texas | 20851820 | 268580.82 | 10/25/2009 12:03:28 AM |
New York | 18976457 | 54556 | 10/25/2009 12:03:28 AM |
... | ... | ... | ... |
Home | Download | Help | Site index | FAQs | Support |