Data Splitter email parsing

For each email in an input folder Data Splitter loads the header and body as a single input stream.

The header contains one line for each of the following fields :

Field: Field starts with:
subject "Subject:"
sender "From:"
recipient "To:"
date / time "Date:"

The header and body are separated by a blank line, per SMTP specifications.

The techniques described in the Data Splitter tutorial can be used to parse emails.

Email parsing example

screen shot: eBay end-of-auction email parser

Notice that the header and the body are parsed separately.   While parsing the header the start node is set to "Header".   While parsing the body the start node is set to "Body".   After "EndHeader" (the first blank line) is recognized the start node is set to "Body".   The start node must be reset to "Header" in the pre-stream actions.   This practice is advisable for parsing emails whose bodies may contain header contents ("Subject:", "From:", "To:", etc.), for example: when replying or forwarding.

Note the use of a null node, "EndMessage".   The null node is recognized at the end of the input stream (email body).   The values collected from the email are sent to the database at that point by action group "NewEMail".

See sample EMail-To-Database.dss.

In the bodies of these emails the general format of the data we're interested in is :

   label ... spaces ... data ... end-of-line.

A label is descriptive text, typically followed by a colon, for example:  "Item name:".

This solution makes use of the following patterns :

* zero or more of any character
Num one or more numeric digits
WS0+ zero or more whitespace characters (blank, tab, etc.)
EndHeader a blank line (two line feeds)
EndMessage null pattern indicating end of email body

It also makes use of string sets :

SubjectFilter text that begins the subject field in the email header
TextFields a list of text fields expected in the email body, and their destinations in the database
NumericFields a list of numeric field labels and their respective database destinations
CurrencyFields a list of currency field labels and their respective database destinations

and node groups :

Name-Address extracts the name and email address (RName + RMail) from the from/to fields in the email header
Decimal-Number gets a decimal number :  digits + decimal point + digits
Text gets text from the input, strips leading / training blanks, stops at end of line

Here's the definition of the TextFields string set :

screen shot: TextFields string set, maps email fields to database

This solution also makes use of a database / ODBC connection.   File "SQL.txt" accompanies the installation:  it contains an SQL statement for creating the "eauction" table used by this sample.