How it works ...
What's the idea? |
What's parsing? |
State machines |
Regular expressions |
What's the idea?
Data Splitter's design is based on a patented invention: Configurable Pattern Recognition and Filtering Tool.
The fundamental building block is called a subpattern. It consists of :
- A set designator
- The minimum number of occurrences of members of that set
- The maximum number of occurrences of members of that set
Examples of sets :
- letters (a-z, A-Z)
- digits (0-9)
- "anything" - the set including all possible values
The minimum can be zero or more occurrences; the maximum is greater than or equal to the minimum.
Examples of subpatterns :
- one or more digits
- exactly ten printable characters
- zero or more of any value
These subpatterns can be linked together in any order. One of the subpatterns is designated as the "start node". As the input is scanned the "machine" moves from one subpattern to the next, deciding at certain points that subpatterns have been recognized in the input. When this recognition occurs actions can be performed.
Examples of actions :
- transmit one of the recognized subpatterns to a file or a database
- transmit a piece of predefined text to a database table
- execute SQL
- execute a user-defined function
It is possible using this scheme to perform a wide variety of useful data transformation tasks.
Data Splitter's design is based on the idea that many data transformation tasks - searches, conversions, extractions, parsing, ... , involve the same fundamental repetitive process :
- recognition of a pattern in the input,
- transition to another "state" based on recognition of the next pattern in the input.
Data Splitter is very general. It knows about sets and patterns, states and transitions, and views the input as a stream of values (e.g. bytes or characters). It has no internal knowledge of XML, HTML, RTF, or even text files; it can be configured to work with all of them. The details of the transformation task are specified in the solution (.DSS file).
Another way to look at it: the task-specific logic is contained in the configuration instead of the program itself. The user has several options :
- using a ready-made Data Splitter solution (.DSS file),
- using a ready-made solution with some modification,
- creating an Data Splitter solution from scratch.
Of course, there are limits to what Data Splitter can do. Developers can extend Data Splitter's capability with user-defined functions in a custom DLL. See Data Splitter help and the development tools topic for more information.
Parsing
Parsing a stream of data means breaking it down into component parts according to a set of rules.
Parsing programs typically check each character in a data stream and group the characters into units known as "tokens". What constitutes a token can differ from one program to the next, or from one set of grammatical rules to the next. With Data Splitter the tokens are entirely user-defined.
In a web page, for example, the tokens would typically be HTML tags (<TABLE>, for example), and the data between the tags.
In an email the tokens are typically labels ("Subject:", for example) and their associated data.
Regular expressions
Many programming systems use regular expressions to parse data. Here's a regular expression for parsing an email address :
'^[a-zA-Z0-9_\.\-]+@[a-zA-Z0-9\-]+\.[a-zA-Z0-9\-\.]+$'
This means - in a nutshell - characters with an ampersand in the middle. This part :
a-zA-Z0-9
means alphanumeric characters: A-Z upper- and lower-case, plus digits.
With Data Splitter you would define the alphanumeric set one time, calling it "alphanum" or something similar.
Data Splitter has no special characters to worry about. With regular expressions some characters have special meanings, so if you need to handle these special characters in your input you have to "escape" them with the backslash ('\') character. The period is a special character; it can also appear in email addresses - that's why you see it, escaped :
\.
three times in the above regular expression. Data Splitter's design avoids the special-character issue entirely.
Data Splitter provides an alternative to regular expressions. As such, support for regular expressions is not planned for any future release.
State machines, state awareness
This is a little technical, but it is helpful to understand the general idea: A state machine can consider the overall structure of an "input stream", for example a file or an email message :
A state machine -
It knows where it's been.
Consider email messages. In general, an email is composed of a header followed by a body. So, the first state entered when scanning an email can be the "header" state, followed by the "body" state. Within the header state there can be a state for each component, i.e. the "subject" state, the "from" state, the "date" state, etcetera.
To illustrate the importance of state-awareness, consider a message format with "From" and "To" addresses :
From:
Address:
…
To:
Address:
Simply looking for "Address:" isn't enough; you have to know which address you're dealing with, i.e. whether it's "From" or "To". Data Splitter's state-aware design can handle this type of format easily. Some parsing utilities can't handle this situation or have to be specially "rigged" to do so.
Data Splitter allows configuration of a state machine. All computer programs are themselves state machines, but only a few parsing utilities (Yacc, for example) allow programming of overall state machine behavior. Data Splitter is unique in providing (requiring!) a pictorial representation of the state machine that does the job.