Extracting data from emails

eBay notification parsing example

EMail-To-Database.dss separates an eBay end-of-auction notification email into its components and loads the fields into a database table called "eauction" :

Sample database output generated from parsed email

(MS Visual FoxPro sample).

To obtain similar results :

SQL statement for creating database table "eauction":  sqlebay.txt

It's possible to get results with Data Splitter using ready-made solutions such as "EMail-To-Database.dss" with minor modifications, and without even understanding how they work.   However, with the help of the Data Splitter graphical interface you can modify and create your own solutions/.DSS files and extract from stored emails virtually any data that interests you.   Consult Data Splitter help and this website's tutorial.

EMail-To-Database.dss makes use of string sets.   The string sets used in this solution are :

The SubjectFilter simply contains the text string "eBay End of Auction".

Here's the "TextFields" definition :

screen shot: TextFields string set, maps email fields to database

This maps the email fields to their respective database destinations.

The effect of this mapping is to transmit the data that follows the specified labels ("Item name:", for example) to the corresponding fields in a new record in the "eauction" database table.   There are other string sets that perform similar functions for the various data types (NumericFields, CurrencyFields).

This solution can easily be adapted to parse a wide variety of emails that contain information in the

    label: <data>

format by modifying the string sets :

... modifying the action group (menu option: Definitions | Action groups), and modifying the comment (menu option: Solution | Comment).

Solution "EMail-To-Database.dss" accompanies the Data Splitter installation.

Contact Data Splitter support for assistance in creating other email parsers.

Here's the EMail-To-Database.dss parser :

screen shot: eBay end-of-auction email parser

This parser has two major sections:  Header and Body.   Parsing starts with the header.   When the end-of-header indicator is encountered the start node is changed to the Body node.   This keeps message header and body parsing distinct (this is not always necessary, but is good practice nonetheless).

Header parsing looks for "Subject:", "From:", "To:" and "Date:" fields at the beginning of the message.   Filtering is performed on the contents of the Subject field; the other fields are simply parsed and sent to their respective "targets" (database fields in this case).   "Subject:" parsing considers two cases :

  1. the subject begins with one of the entries in the SubjectFilter string set,
  2. the subject begins with something else.

In the first case the subject field is constructed from the SubjectFilter and the rest of the line.   In the latter case parsing is terminated with the "Halt" action.

When the end-of-header marker is encountered in the input (two "newlines", per SMTP specifications) the start node is switched to the Body node. The parser then searches for three types of data: text fields, numeric fields and currency fields.   Each data type has its own characteristics and parsing rules.   Each data type therefore has its own string set.   Data that is not specified in one of the string sets is ignored.

When there is no more email data to parse the EndMessage ("null") node is recognized.   At this point the NewEMail action group is executed.   The NewEMail action group sets a few additional database fields and writes a new "row" (record) to the database.