Guide to the sample solutions accompanying
the Data Splitter installation

EMail-Search
EMail-Search-Word-Pairs
EMail-To-Database
File-Count-Keywords
File-Count-Lines
File-CRLF-LF
File-Filter-Unprintables
File-Generate-Site-Index
File-HTML-Generate-Line-Breaks
File-LF-CRLF
File-Search
File-Search-Replace
Web-Extract-Number
Web-Extract-Title-Header
Web-Watch-Words

EMail-Search.dss

Finds messages containing any of the text strings listed in the "SearchText" string set. User must specify input folders and "SearchText" items. Produces HTML output.

Makes use of post-processor File-HTML-Generate-Line-Breaks.dss, which restores the line-by-line appearance of the original emails that is lost in the initial conversion to HTML format.

View the final results in output file "Results" (View | Results).
View the intermediate results without the line breaks in output file "TempFile" (View | TempFile).

Note: Prior to running the sample email parsers it may be advisable to set the message profile in Options | Message.

EMail-Search-Word-Pairs.dss

Finds messages with proximate text strings, i.e. two strings near each other. User must specify input folders and the "Word1/2" lists. Produces HTML output.

Makes use of 3 node groups :

HTML-Strip	converts HTML tags to text so they will display as-is in the HTML output
EMail-SearchWordPairsInner	does the actual formatting of an email that has been determined to have word pairs near each other
HTML-Add-Line-Breaks	restores line breaks in HTML, similar to the File-HTML-Generate-Line-Breaks post-processor

This example locates emails with the words "e-mail" OR "email" and variations on the word "parse" somewhere near each other in the message body, "near" being defined as within 200 characters. See definition of Pattern "(near)".

EMail-To-Database.dss

Sample email parser that transmits extracted fields to a database. This example parses eBay end-of-auction notification messages into database table "eauction". Can be customized for other generated email formats by modifying the string sets :

SubjectFilter	text that begins the subject field in the email header
TextFields	a list of text field labels and their respective database destinations
NumericFields	a list of numeric field labels and their respective database destinations
CurrencyFields	a list of currency field labels and their respective database destinations

Makes use of 3 node groups :

Name-Address	extracts the name and email address (RName + RMail) from the from/to fields in the email header
Decimal-Number	gets a decimal number: digits + decimal point + digits
Text	gets text from the input, strips leading / training blanks, stops at end of line

A single action group, "NewEMail", transmits the parsed fields to the database.

Also requires a database / ODBC connection. File SQL.txt accompanies the installation: it contains an SQL statement for creating the "eauction" table used by EMail-To-Database.dss.

File-Count-Keywords.dss

Counts words and keywords in a group of HTML files (keywords relating to "email" and "parsing" in this example, see the "Keywords" string set definition). User must specify input files and keywords of interest. Produces HTML and text output.

Note: The link from the Start ("*") node to the "Word" node must always have the largest number (i.e. be the last link in the sequence). Keywords in nodes with link numbers greater than the general-case "Word" node will never be found (think about this one)!

Uses node group HTML-element.dsss to skip over HTML tags.

File-Count-Lines.dss

Determines the number of lines (new line characters) in a group of text files. User must specify input files. Produces HTML output.

File-CRLF-LF.dss

Replaces carriage return / line feed (CRLF) sequences with single line feed characters (LF). User must specify input files and and an existing output directory.

File-Filter-Unprintables.dss

Extracts printable characters (ASCII 30-126) from the input, discards everything else. User must specify input files. Produces "cleaned up" output with newlines where the unprintable characters were.

File-Generate-Site-Index.dss

Generates a website index from a group of HTML files. Extracts the content of the <TITLE> tag and the "description" META tag, and generates a single HTML file, siteindex.htm. User must specify the input files and may have to modify the hard-wired META tag search string :

		<META name="description" content=

... depending on how those tags are coded in the input files.

File-HTML-Generate-Line-Breaks.dss

Transforms line breaks (carriage return / line feed pairs) to HTML <BR> tags. Used to post-process the output of sample EMail-Search.dss (above).

There are two nodes in this solution to handle the possibility that the input contains a mixture of CRLF and LF newlines. It looks for CRLF 1st, LF 2nd, and converts both to HTML line breaks.

File-LF-CRLF.dss

Replaces line feed characters with carriage return / line feed sequences. User must specify input files and and an existing output directory.

File-Search.dss

Searches for one or more text strings in the input files. User specifies :

the input files
the search items

The "search items" are defined as a string set. Specify the search text in the "Text" column of "search items".

Running File-Search.dss produces :

an HTML output file
a text output file
a list of the input files that contain "hits"
a list of "stats" (totals, etc.)

File-Search-Replace.dss

Searches for and replaces one or more text strings in the input files. Specify the search text in the "Text" column of the "new text" string set; specify the replacement text in the "Other text" column.

Web-Extract-Number.dss

This is an HTML parser that extracts two items from each web page in the input URL list :

the content of the first <h1> tag
the decimal number following the specified label

Specify the label by pressing the Define Label button. Press Run and wait for scanning to complete, then press View Results.

The output is an HTML file with a table containing, for each input URL :

the extracted items
the source URL
the date and time the web page was fetched

The HTML header for the output file is contained in text file Web-Extract-Number-Header.txt (file variable HTMLHeader). You can modify this text file to change the output's appearance (font, color, etc.).

This example extracts the content of an <h1> tag and a decimal number. It can be modified to extract from another identifying tag (<title>, for example), and to extract other data formats (numbers without decimal points, text, etc.).

Web-Extract-Title-Header.dss

This HTML parser extracts two items from each web page in the input URL list :

the content of the web page's <title> tag
the content of the first header tag as defined in string set HeaderTag

The output is a web page (HTML file) containing a brief listing for each input URL :

the extracted title tag
the extracted header tag
the source URL
the date and time the web page was fetched

The HTML header for the output file is contained in text file Web-Extract-Title-Header.txt (file variable HTMLHeader). You can modify this text file to change the output's appearance (font, color, etc.).

Web-Watch-Words.dss

This sample watches a list of websites for keyword occurrences. The user specifies :

the URLs to watch
the words to watch for
the HTML tags to watch
the timer start time
the timer interval

The output is a single HTML file (web page) displaying the time of the scan, the URLs of the pages containing the search words, and the text containing the search words.