| Set Machine home | |
| Download | |
| Register | |
| Tutorial | |
| Help | |
| Site map | |
| Contact info |
| What's the idea? |
| What's parsing? |
| State machines |
| Regular expressions |
Set Machine's design is based on a patented invention: "Configurable Pattern Recognition and Filtering Tool". The fundamental building block is called a subpattern. It consists of :
Examples of sets :
The minimum can be zero or more occurrences; the maximum is greater than or equal to the minimum.
Examples of subpatterns :
These subpatterns can be linked together in any order. One of the subpatterns is designated as the "start node". As the input is scanned the machine moves from one subpattern to the next, deciding at certain points that subpatterns have been recognized in the input. When this recognition occurs actions can be performed.
Examples of actions :
It is possible using this scheme to perform a wide variety of useful data transformation tasks.
Set Machine's design is based on the idea that many data transformation tasks - searches, conversions, extractions, parsing, ... , involve the same fundamental repetitive process :
Set Machine is very general. It knows about sets and patterns, states and transitions, and views the input as a stream of values (e.g. bytes or characters). It has no internal knowledge of XML, HTML, RTF, or even text files; it can be configured to work with all of them. The details of the transformation task are specified in the configuration (definition).
Another way to look at it: the task-specific logic is contained in the configuration instead of the program itself. The user has several options :
Of course, there are limits to what Set Machine can do. The user has the option of extending Set Machine's capability with user-defined functions in a custom DLL. See Set Machine help and Development tools for more information.
Parsing a stream of data means breaking it down into component parts according to a set of rules.
Parsing programs typically check each character in a data stream and group the characters into units known as tokens. What constitutes a token can differ from one program to the next, or from one set of grammatical rules to the next. With Set Machine the tokens are entirely user-defined.
In a web page, for example, the tokens would typically be HTML tags (<TABLE>, for example), and the data between the tags.
In an email the tokens are typically labels ("Subject:", for example) and their associated data.
Many programming systems use regular expressions to parse data. Here's a regular expression for parsing an email address :
'^[a-zA-Z0-9_\.\-]+@[a-zA-Z0-9\-]+\.[a-zA-Z0-9\-\.]+$'
This means - in a nutshell - characters with an ampersand in the middle. This part :
a-zA-Z0-9
means alphanumeric characters: A-Z upper- and lower-case, plus digits.
With Set Machine you would define the alphanumeric set one time, calling it "alphanum" or something similar.
Set Machine has no special characters to worry about. With regular expressions some characters have special meanings, so if you need to handle these special characters in your input you have to "escape" them with the backslash ('\') character. The period is a special character; it can also appear in email addresses - that's why you see it, escaped :
\.
three times in the above regular expression. Set Machine's design avoids the special-character issue entirely.
Set Machine provides an alternative to regular expressions. As such, support for regular expressions is not planned for any future release.
This is a little technical, but it is helpful to understand the general idea: A state machine can consider the overall structure of an "input stream", for example a file or an email message :
A state machine -
It knows where it's been.
Consider email messages. In general, an email is composed of a header followed by a body. So, the first state entered when scanning an email can be the "header" state, followed by the "body" state. Within the header state there can be a state for each component, i.e. the "subject" state, the "from" state, the "date" state, etcetera.
To illustrate the importance of state-awareness, consider a message format with "From" and "To" addresses :
From:
Address:
…
To:
Address:
Simply looking for "Address:" isn't enough; you have to know which address you're dealing with, i.e. whether it's "From" or "To". Set Machine's state-aware design can handle this type of format easily. Some parsing utilities can't handle this situation or have to be specially "rigged" to do so.
Set Machine allows configuration of a state machine. All computer programs are themselves state machines, but only a few parsing utilities (Yacc, for example) allow programming of overall state machine behavior. Set Machine is unique in that it provides (requires!) a pictorial representation of the state machine that does the job.
| Set Machine home | Download | Register | Tutorial | Help | Site map | Contact info |