Guide to the sample solutions accompanying
the Data Splitter installation
- EMail-Search
- EMail-Search-Word-Pairs
- EMail-To-Database
- File-Count-Keywords
- File-Count-Lines
- File-CRLF-LF
- File-Filter-Unprintables
- File-Generate-Site-Index
- File-HTML-Generate-Line-Breaks
- File-LF-CRLF
- File-Search
- File-Search-Replace
- Web-Extract-Number
- Web-Extract-Title-Header
- Web-Watch-Words
EMail-Search.dss
Finds messages containing any of the text strings listed in the "SearchText" string set. User must specify input folders and "SearchText" items. Produces HTML output.
Makes use of post-processor File-HTML-Generate-Line-Breaks.dss, which restores the line-by-line appearance of the original emails that is lost in the initial conversion to HTML format.
- View the final results in output file "Results" (View | Results).
- View the intermediate results without the line breaks in output file "TempFile" (View | TempFile).
Note: Prior to running the sample email parsers it may be advisable to set the message profile in Options | Message.
EMail-Search-Word-Pairs.dss
Finds messages with proximate text strings, i.e. two strings near each other. User must specify input folders and the "Word1/2" lists. Produces HTML output.
Makes use of 3 node groups :
HTML-Strip | converts HTML tags to text so they will display as-is in the HTML output |
EMail-SearchWordPairsInner | does the actual formatting of an email that has been determined to have word pairs near each other |
HTML-Add-Line-Breaks | restores line breaks in HTML, similar to the File-HTML-Generate-Line-Breaks post-processor |
This example locates emails with the words "e-mail" OR "email" and variations on the word "parse" somewhere near each other in the message body, "near" being defined as within 200 characters. See definition of Pattern "(near)".
EMail-To-Database.dss
Sample email parser that transmits extracted fields to a database. This example parses eBay end-of-auction notification messages into database table "eauction". Can be customized for other generated email formats by modifying the string sets :
SubjectFilter | text that begins the subject field in the email header |
TextFields | a list of text field labels and their respective database destinations |
NumericFields | a list of numeric field labels and their respective database destinations |
CurrencyFields | a list of currency field labels and their respective database destinations |
Makes use of 3 node groups :
Name-Address | extracts the name and email address (RName + RMail) from the from/to fields in the email header |
Decimal-Number | gets a decimal number: digits + decimal point + digits |
Text | gets text from the input, strips leading / training blanks, stops at end of line |
A single action group, "NewEMail", transmits the parsed fields to the database.
Also requires a database / ODBC connection. File SQL.txt accompanies the installation: it contains an SQL statement for creating the "eauction" table used by EMail-To-Database.dss.
File-Count-Keywords.dss
Counts words and keywords in a group of HTML files (keywords relating to "email" and "parsing" in this example, see the "Keywords" string set definition). User must specify input files and keywords of interest. Produces HTML and text output.
Note: The link from the Start ("*") node to the "Word" node must always have the largest number (i.e. be the last link in the sequence). Keywords in nodes with link numbers greater than the general-case "Word" node will never be found (think about this one)!
Uses node group HTML-element.dsss to skip over HTML tags.
File-Count-Lines.dss
Determines the number of lines (new line characters) in a group of text files. User must specify input files. Produces HTML output.
File-CRLF-LF.dss
Replaces carriage return / line feed (CRLF) sequences with single line feed characters (LF). User must specify input files and and an existing output directory.
File-Filter-Unprintables.dss
Extracts printable characters (ASCII 30-126) from the input, discards everything else. User must specify input files. Produces "cleaned up" output with newlines where the unprintable characters were.
File-Generate-Site-Index.dss
Generates a website index from a group of HTML files. Extracts the content of the <TITLE> tag and the "description" META tag, and generates a single HTML file, siteindex.htm. User must specify the input files and may have to modify the hard-wired META tag search string :
<META name="description" content=
... depending on how those tags are coded in the input files.
File-HTML-Generate-Line-Breaks.dss
Transforms line breaks (carriage return / line feed pairs) to HTML <BR> tags. Used to post-process the output of sample EMail-Search.dss (above).
There are two nodes in this solution to handle the possibility that the input contains a mixture of CRLF and LF newlines. It looks for CRLF 1st, LF 2nd, and converts both to HTML line breaks.
File-LF-CRLF.dss
Replaces line feed characters with carriage return / line feed sequences. User must specify input files and and an existing output directory.
File-Search.dss
Searches for one or more text strings in the input files. User specifies :
- the input files
- the search items
The "search items" are defined as a string set. Specify the search text in the "Text" column of "search items".
Running File-Search.dss produces :
- an HTML output file
- a text output file
- a list of the input files that contain "hits"
- a list of "stats" (totals, etc.)
File-Search-Replace.dss
Searches for and replaces one or more text strings in the input files. Specify the search text in the "Text" column of the "new text" string set; specify the replacement text in the "Other text" column.
Web-Extract-Number.dss
This is an HTML parser that extracts two items from each web page in the input URL list :
- the content of the first <h1> tag
- the decimal number following the specified label
Specify the label by pressing the Define Label button. Press Run and wait for scanning to complete, then press View Results.
The output is an HTML file with a table containing, for each input URL :
- the extracted items
- the source URL
- the date and time the web page was fetched
The HTML header for the output file is contained in text file Web-Extract-Number-Header.txt (file variable HTMLHeader). You can modify this text file to change the output's appearance (font, color, etc.).
This example extracts the content of an <h1> tag and a decimal number. It can be modified to extract from another identifying tag (<title>, for example), and to extract other data formats (numbers without decimal points, text, etc.).
Web-Extract-Title-Header.dss
This HTML parser extracts two items from each web page in the input URL list :
- the content of the web page's <title> tag
- the content of the first header tag as defined in string set HeaderTag
The output is a web page (HTML file) containing a brief listing for each input URL :
- the extracted title tag
- the extracted header tag
- the source URL
- the date and time the web page was fetched
The HTML header for the output file is contained in text file Web-Extract-Title-Header.txt (file variable HTMLHeader). You can modify this text file to change the output's appearance (font, color, etc.).
Web-Watch-Words.dss
This sample watches a list of websites for keyword occurrences. The user specifies :
- the URLs to watch
- the words to watch for
- the HTML tags to watch
- the timer start time
- the timer interval
The output is a single HTML file (web page) displaying the time of the scan, the URLs of the pages containing the search words, and the text containing the search words.
Home | Download | Help | Site index | FAQs | Support |