Description of Data Splitter
Data Splitter is: a data extraction + organization tool. It consists of two principal software components :
- A data transformation engine,
- A configuration facility for defining the data required by the engine.
The Data Splitter configuration facility
Data Splitter enables the user to define the data required by the engine :
- PGraphs (composed of Nodes and Links) - see also the Node Group topic
- Input streams
- Output streams
- ... along with other data.
Data Splitter supports definition of all of the data required by the engine. The solution created by Data Splitter can be saved to a file (.DSS file) for later use. A free-format textual comment can be added to the solution (menu option Solution | Comment).
When the appropriate data items have been defined the engine can be started with the Run command.
A set is a set of values. It is defined by the user as a sequence of individual values and value ranges. Data Splitter currently supports sets of byte values (valid range: 0-255) and double-byte values (valid range: 0-65535). Every set is identified by a unique user-defined textual identifier, or tag.
A pattern is a sequence of subpatterns. A subpattern is defined by :
- an associated set
- a minimum length
- a maximum length
A subpattern is said to be recognized if the consecutive sequence of input stream values at the current input stream location, of at least the minimum subpattern length, are all elements of the subpattern's associated set. Input stream values following the recognized subpattern, that are elements of the subpattern's associated set, may be included as part of the recognized subpattern up to and including the subpattern's maximum length. The "may be" in the preceding sentence is elaborated upon in the Ambiguity topic.
A pattern is said to be recognized if all of its subpatterns are recognized, sequentially, in the input stream. See the Pattern topic for more information.
A text string is a special case of a pattern - a sequence of characters that can be typed in at a computer keyboard. Data Splitter allows strings to be used as patterns. This special support of string definition is provided so that a simple character string pattern need not be defined (cumbersomely) as a sequence of subpatterns. A string is recognized if it occurs at the current input stream location. Case-sensitivity can be enabled or disabled for string recognition.
A variable can also be used as a subpattern's associated set. The subpattern is recognized if the variable is matched at the current location in the input stream. The "Ignore case" setting for the variable is used in this situation.
A pattern of zero maximum length can be defined. This pattern matches either the beginning or end of the input stream, depending on the input stream location. A zero-length pattern doesn't match any input stream values; it is considered recognized before any input stream values have been examined and after all input stream values have been examined.
Every pattern is identified by a unique user-defined tag.
Engine processing is driven by PGraphs, which are composed of one or more user-defined nodes connected to each other by user-defined links. Any two nodes in the graph can be connected by a link. A link connects exactly two nodes and is directional, so that for each link one node is the "from" node and the other is the "to" node. A pair of nodes A and B can be linked "circularly", i.e. there can be a link from A to B, and from B to A. There is no requirement that all nodes be linked somehow to other nodes. Single nodes or connected groups of nodes can be left disconnected from other nodes/groups ("parked"), awaiting activation by the creation of a link or by the "assign start node" action.
At any time during a run exactly one node is the current node. At the start of a run the current node is set to the user-defined start node. During a run the PGraph is traversed: one of the nodes the current node is linked to may become the new current node, depending on the next node pattern recognized in the input stream.
Exactly one node in the PGraph is defined as the start node, which identifies the starting point of the PGraph traversal. The start node can be assigned / reassigned by the "assign start node" action during a run. There is always a start node: if the user does not define it the default is the first node created.
A Data Splitter solution consists of a single "main graph" and zero or more optional Node Groups. The main graph is where processing starts, i.e. it is the controlling PGraph. The node groups are used by the main graph and each other.
A node has associated with it exactly one pattern (or string). This node-pattern association is user-defined. A node is said to be recognized if its associated pattern/string has been recognized in the input stream.
A node can be identified by a user-defined tag. This tag is used to refer to the node elsewhere in the solution. The tag on a node is optional, however, and need not be unique. If a group of two or more nodes shares the same tag, that tag refers to the node in that group that was most recently recognized in the input stream. If none of the nodes grouped by a non-unique tag has been recognized in the current input stream, that tag refers to a null node whose pattern length is zero.
A node has associated with it zero or more user-defined actions, which determine processing to be performed when the node is recognized in the input stream.
Every node has an ordinal number (node number). This number is unique for each node, ranging from one to the number of nodes in the PGraph. Data Splitter provides defaults for the node numbers (order of node creation), but the user can also define the node numbers. Node numbers can be used to set the "tabbing order" of the nodes. The node numbers currently have no effect on the operation of the engine - they are used solely for user-entry convenience (tabbing).
A link connects two nodes. It is directional, establishing a "from" and a "to" node. This direction determines the order of PGraph node traversal during a run.
Every link has a number. Links emanating from a given node are numbered uniquely. Link numbers establish the order for determining which of the nodes the given node is linked "to" will become the next current node during PGraph traversal.
Data Splitter provides defaults for the link numbers (order of link creation), but the link number can also be defined by the user. If the user specifies link number N for a link, and N has already been assigned to another link emanating from the same node, Data Splitter will increment by one the number of all links emanating from that node whose link number is greater than or equal to N before assigning N to the newly-numbered link, thereby preserving link number uniqueness.
Data Splitter prevents the creation of duplicate links. A duplicate link is one that has the same "from" and "to" nodes as an existing link.
A variable is a data item that can be assigned an initial value and reassigned values during a run. Currently supported variable types include :
- ASCII strings
- Unicode (wide) strings
- Byte sequences
Every variable is identified by a unique user-defined tag.
A counter is a number (integer) that can be initialized, incremented and used in actions.
Actions can be invoked at various points during a run : before and/or after the run itself, before and/or after scanning an input stream, and when a node is recognized. Actions are user-defined but are derived from a predefined list of available actions, including :
- Transmit data to a file or database field
- Set or clear a variable
- Set or increment a counter
- Execute a user-defined group of actions (a procedure)
- Execute a user-defined action (DLL function or SQL statement)
- Assign/reassign the start node
- Halt processing of the current input stream
- Halt the run
Data that can be transmitted include :
- recognized data from the input stream
- special data (e.g. current file name, date/time)
- ASCII or Unicode strings
The data transmission actions will be enhanced in future releases to allow sending output to, for example, the Windows clipboard and other destinations.
The engine processes (scans) a single input stream at a time. A run scans a sequence of input streams. Data Splitter supports definition of an input stream sequence, containing one or more stream specifications ("Input/Output" menu, "Input" dialogs).
The engine can output data to any number of streams. Every output stream is identified by a unique user-defined tag and an access mode (e.g. overwrite, append, etc.).
PGraph graphical definition
Data Splitter supports graphical definition of the PGraph.
Note: in the following discussion, the term "click" refers to a single click of the left mouse button.
Nodes can be placed on the Data Splitter drawing surface (Windows client area) by double-clicking at the desired node location. A node can be selected for modification by single-clicking on its graphic. The node selection can also be changed by pressing the Tab key, which will cause the node selection to "traverse" the PGraph in node number order, ascending or descending number order for right or left tab, respectively. A node's property (attribute) menu can be accessed by selecting the node and pressing the Enter key, or by double-clicking on the node's graphic.
A node can be moved on the drawing surface by selecting it and dragging it with the left mouse button held down. A node can also be moved by selecting it and using the left/right/up/down arrow keys to perform fine adjustments of the node location. A node can be deleted by selecting it and pressing the Delete key, or by accessing the node's property menu and selecting the "Delete" menu option. Node properties (node tag, pattern, string, "links to", start node, actions, number, etc.) can be modified using dialogs accessed via options in the node's property menu.
Node links are defined by selecting the "Link to" option of the "from" node property menu. "Link to" initiates a drawing operation, creating a line originating at the selected node (the "from" node) and terminating at the mouse location. The line termination point follows mouse movements until a "to" node for the link is selected. The "to" node for the link is selected by single-clicking on any other node in the PGraph. The link drawing operation can be cancelled at any time by single-clicking anywhere but on another node. The user merely chooses the two nodes to connect by a link; Data Splitter determines how to draw the link once the "to" node is selected.
Every link has a graphical representation, drawn as connected line segments (3, currently) with an arrow adjoining and pointing to the "to" node. A small box is drawn near the midpoint of the link graphic; this box displays the link's number. Link selection is performed in a similar manner to node selection: clicking on the link's box graphic selects the link; a link's property (attribute) menu is accessed by selecting the link and pressing the Enter key, or by double-clicking on the link's box graphic. A link can be deleted by selecting it and pressing the Delete key, or by accessing the link's property menu and selecting the "Delete" menu option. If a node is deleted all links to and from that node are automatically deleted. Link properties (link number, etc.) can be modified using dialogs accessed via the link's property menu. Links are automatically redrawn when nodes are moved or deleted by the user.
User input checking
Data Splitter enforces, and facilitates, the correctness and consistency of user input. Data items that refer to other data items (via tags, usually) must be defined after the items they refer to. The user is not allowed to configure a reference to an undefined tag. Context-sensitive help is available wherever, for example, a tag must be chosen: upon requesting help (by pressing F1) the user is presented with a list displaying the appropriate tags available for use in the current data entry context.
During a run the data transformation engine scans the user-defined sequence of input streams, one stream at a time. The current input stream is "opened" and the engine sets its input stream location (the current position within the input stream, referred to hereafter as the stream location) to the beginning of the input stream, i.e. position zero. The current node is also set to the start node.
At this point PGraph traversal begins. The stream location is advanced until the start node's pattern is recognized, or until the end of the input stream is encountered.
When the current node is recognized (i.e. when the pattern associated with the current node is recognized at the current input stream location) the engine does several things :
- It performs the actions associated with the current node,
- It attempts to determine the next node to become the current node in the graph traversal.
Performing the actions simply means executing in sequence the list of actions (if there are any) associated with the current node.
To select the next current node, the engine considers the "to" nodes of each link emanating "from" the current node. Each of these "to" nodes is a potential next node. The potential next nodes are examined in link number order. When one of these potential next nodes is recognized in the input stream immediately following the values recognized as the current node's pattern, that potential next node becomes the current node. If no potential next node is recognized the start node is selected as the next current node and the engine once again scans the input stream in an attempt to recognize the start node.
Graph traversal for the current input stream ends when the stream location can no longer be advanced within the input stream, i.e. when the end of the input stream is encountered. Graph traversal can also be terminated by user-defined "halt" actions.
Every subpattern of a pattern has a minimum and maximum length. If a subpattern's maximum length exceeds its minimum length a potential ambiguity arises. Beyond the minimum length, and up to and including the maximum length, input stream values contained in the current subpattern's set may also be contained in the next subpattern's set (if there is a next subpattern), or may be recognized as (the start of) one of the potential next nodes' patterns (if there are any potential next nodes). The question arises: should these ambiguous input stream values be "assigned to" the current subpattern, or to the next subpattern / potential next node to which they may also be assigned? In other words, when should the engine, in such an ambiguous situation, make the transition from the current subpattern to the next possible state? Among the possibilities are :
- Making the transition to the next state as soon as possible (ASAP),
- making the transition as late as possible (ALAP).
The engine can support both "ASAP" and "ALAP" transition modes. The current version makes transitions ALAP if the next state is a subpattern (i.e. if the subpattern being examined is not the last one in its containing pattern). Transitions are ASAP if the next state is a potential next node (i.e. if the subpattern being examined is the last one in its containing pattern). Experimentation has shown this combination to be effective, but options may be provided to make these transition modes user-definable in a future release of Data Splitter.
"Transition ALAP", unlike "ASAP", regards only the value at the current input stream location, i.e. doesn't look ahead. Input stream values are associated with the current subpattern until one of the following becomes true :
- The input stream value is not a member of the subpattern's set,
- the subpattern's maximum length has been attained,
- the end of the input stream has been reached.
"Transition ASAP" between nodes allows, for example, definition of a subpattern of "anythings" (with a set containing all possible values) followed by a next state (potential next node) whose pattern is specific, thereby instructing the engine, effectively, to look ahead and seek the specific pattern.