Data Splitter Overview
Data Splitter is a data extraction and organization tool: search, extract, and transform.
Data Splitter scans input from :
- web pages
- email messages
- the Windows clipboard
The user can define actions to send output to :
- database records
- the Windows clipboard
The basic sequence of operations is :
- create or open a solution,
The user has several options regarding Data Splitter solutions, also referred to as DSS files :
- use an existing DSS file,
- use a existing DSS file with some modification,
- create a DSS file from scratch.
Sample .DSS files accompany this program installation. Others are available on the Web at http://datasplitter.com.
Consult the Data Splitter Tutorial for additional assistance.
The remainder of this overview discusses creation and modification of Data Splitter solutions / DSS files.
Remember: wherever you are in Data Splitter, you can always press the F1 key for help.
Data Splitter solutions
A Data Splitter solution is expressed as :
- tables defining sets, patterns, input/output streams, actions, variables and other data,
- graphs (PGraphs) composed of nodes and links.
Patterns are defined in terms of sets. A set is simply a user-defined set of values (for example: whitespace characters, alphanumeric characters). A pattern is a sequence of subpatterns ... a subpattern is a sequence of members of a single set, having a minimum and a maximum length. Sample pattern: one to five whitespace characters followed by one to ten digits followed by ... etcetera.
A pattern therefore has a (possibly variable) length (which can be zero) and can be compared to a section of the input data to determine whether or not the data matches the pattern.
A string is a special case of a pattern - the user interface accommodates two pattern entry modes - Pattern and String. This is because defining a string as a pattern would be cumbersome.
A PGraph is a group of nodes connected to each other by directional links. "Directional" means that a link travels from one node to another. Nodes are processed one at a time; the node currently being processed is referred to as the current node. Each node has associated with it the following types of information :
- A textual identifier, or "tag" (optional)
- One pattern
- Zero or more actions
- Zero or more links (to "next nodes")
When a node's pattern is recognized in the input stream its actions (if any) are performed.
Three run commands are available :
- file input
- email input
- URL input
When file input is chosen the files listed in the input Files dialog are scanned. When email input is chosen all messages in the folders listed in the input Folders dialog are scanned. When URL input is chosen the URLs listed in the input URLs dialog are scanned.
The run terminates when the end of the input is encountered or a user-defined halt action is performed.
The start of a run
A run starts by scanning the input stream until the start node's pattern is recognized. If / when the start node is recognized it becomes the current node. Its actions are performed and an attempt is then made to select, or "make a transition to", one of the "next nodes" (i.e. the nodes to which it is linked).
The start node is special in more than one way. First of all, as just noted, it is the node whose pattern is sought at the start of the scan. Also, the start node's pattern is sought whenever there is no other node to transition to: 1) after a "terminal node" (a node from which no links emanate) has been processed or 2) when a "next node" cannot be recognized. Another way to think of it: when the graph traversal gets "lost" it tries to get back on track by searching for the start node's pattern.
Selection of the next node
After a node's actions have been performed an attempt is made to make a transition to the next node. The next node is selected from those (if any) that the current node is linked to. The selection of the next node is based on :
- Link order (1..N),
- The subsequent contents of the input stream.
Subsequent contents: The input stream is scanned using an advancing "pointer", or index. When a pattern is matched (recognized) this index points to the beginning of the pattern in the input stream. After that pattern is processed the pointer advances to a position in the input stream immediately following the end of the pattern. It is this "subsequent" part of the input stream that is used to select the next node.
A node has zero or more links emanating from it to other nodes. The links identify the potential next nodes. The next nodes are examined in link order (1..N). If a next node's pattern matches the "subsequent contents" of the input stream the next node becomes the current node, which is then processed as the preceding node was. If no next node's pattern can be matched scanning resumes at the start node.
A node with no links emanating from it is referred to as a terminal node. After a terminal node is processed the scan resumes by again seeking the start node.
In order for Data Splitter to produce any results actions must be defined. Actions can be run :
- Before / after a run,
- Before / after processing input streams (files/emails/URLs),
- When a node's pattern is recognized in the input.
Actions manipulate and output data. For example :
Send SearchText (to) MainOutput
In this example, "Send to" is the action descriptor, "SearchText" is the node tag, and "MainOutput" is the output stream tag. This action would cause a portion of the input stream matching the "SearchText" node's pattern to be output to the "MainOutput" stream.
Actions can also be defined to execute user-defined functions (groups of actions, SQL, or custom code in a Dynamic Link Library). Press F1 in the "Action" column of an "Actions" grid to view the available actions.
Data Splitter pattern definitions allow for uncertainty in the input, or ambiguity, by requiring the user to specify a minimum and maximum length for each subpattern. A phone number, for example, might be defined as one to five blanks (or "whitespaces") followed by seven to ten digits. In this simple case, the "PhoneNumber" pattern is composed of two subpatterns. Subpattern 1 is "one to five whitespaces", subpattern 2 is "seven to ten digits". This allows for variations in the input 1) in the number of leading blanks and 2) the number of digits in the phone number itself.
The above example is simple, especially since the two sets that make up the pattern are mutually exclusive, i.e. "whitespace" contains no numbers and "numbers" contain no whitespace.
Now for something slightly more complex. It is possible to define a set including all possible values. For bytes this set would be the range from 0 to 255. Let's give this set the tag "anything" and use it to define a pattern whose tag is "Anything" (notice the use of upper and lower case "A" to differentiate the two tags - Data Splitter tags must be unique but are case-sensitive). Pattern "Anything" will be defined as 0 to 99999 occurrences of "anything".
Now consider a solution with two nodes, one whose pattern is "Anything", the other whose pattern is a search text string, for example, "fool". The "Anything" node is the start node and is linked to the "fool" node. This is equivalent to instructing Data Splitter to recognize 0 to 99999 "anything"s followed by the text "fool".
Think about this ... if Data Splitter simply examines the current input stream value (character in this case) to determine whether it is a member of the current set, and the set is "anything", it is going to pass over the "fool", since "f", "o" and "l" are all members of "anything"! This is not what Data Splitter does, however: Data Splitter looks ahead to the next node(s) in an ambiguous situation.
This behavior allows Data Splitter to search for data in the input stream. Continuing the above example, a third node whose pattern is the text string "wiseguy" can be defined. A link can be created from the "Anything" node to the "wiseguy" node. With the "Anything" node connected to the other two nodes Data Splitter will search for both "fool" and "wiseguy".
There are useful variations on the "anything" theme. For example, a set called "notNewLine" can be defined, containing all values except carriage return and line feed (0-9, 11-12, 14-255, i.e. everything but 10 and 13). This set can be used to define a pattern that can be used to view the input as a sequence of text lines: If the start node's pattern is "NotNewLine" (e.g. 0 to 99999 "notNewLine"s), and other text recognition nodes use the "NotNewLine" pattern, scanning will resume at the start node each time a carriage return or line feed is encountered!
Technically, ambiguity occurs when the minimum is less than the maximum of the last subpattern of a pattern. The "Anything" pattern, described above, is ambiguous because its min is zero, its max is 99999, and it contains only one subpattern, which is therefore the last subpattern. If a subpattern's max and min are unequal, and it's not the last subpattern, no lookahead occurs (i.e. the following subpattern is not considered). In this case, Data Splitter examines the current input stream value and considers only whether or not the value is in the current set and whether the maximum number of occurrences for the current subpattern has been attained. For more information see the Ambiguity topic.
The ambiguity of a subpattern can be removed by setting the maximum and minimum length equal.