Set Machine configuration samples

Title Source Target Purpose 
Quotes01 www database  Scrape stock quotes off the web 
Tables01 www database  Scrape HTML tables off the web 
FundRace www database  Scrape political contribution data 
EMail-Basic email database  Load all messages from folders 
EMail-FromName email database  Identify messages with specifed sender names 
EMail-FromMail email database  Identify messages with specifed sender email addresses 
EMail-WordPairs email HTML  Identify messages with word pairs near each other 
EMail-NearTextRTF email RTF  Identify messages with word pairs near each other 
AmazonOrders email database  Parse Amazon order (sold/ship now) messages 
AmazonRefunds email database  Parse Amazon refund messages 
eBay-EndOfAuction email database  Parse eBay end-of-auction messages 
PayPal-UGotCash email database  Parse PayPal *You've Got Cash!* messages 
Bouncebacks email database  Parse undeliverable email notification messages 
Removes email database  Parse mail list removal requests 
CountKeywords text text  Count words / keywords in a group of text files 
LineCount text text  Count the number of lines in a group of text files 
HTMLLines text HTML  Replace newline characters with HTML <BR> tags 
RTFLines text RTF  Replace newline characters with RTF 'line' tokens 
NewLine-CRLF-LF text text  Replace line feeds with carriage return - line feeds 
NewLine-LF-CRLF text text  Replace carriage return - line feeds with line feeds 
TextSearch text text  Search for text in files 
TextSearchWide text text  Search for text in Unicode / 16-bit text 
WholeWordsOnly text text  Search for text, whole-words only 
CSV-DB CSV text database  Import comma-separated data to database 
ParseXML XML database  Parse XML files into a database 
PADScan XML database  Parse PAD files into a database 
RTF2HTML RTF HTML  Generate web pages from RTF (Rich Text Format) 
SiteMapGen HTML HTML  Generate a site map from existing web pages 
HTMLUnicodeMapsToC HTML C files  Convert ISO8859-to-Unicode maps into C include files 
CFunctions C files text  Extract function headers from C/C++ source 
SearchAndReplace files files  Locate and replace data in files 
SwapBytes files files  Swap bytes in files 
FilterJunk files files  Extract printable characters from files 

See also the following topics :


CFunctions.pxd

Extracts function definition headers from C and C++ source files.   Even works on MFC source.

Contact WWWGrab.com for more information.


CountKeywords.pxd

Counts words and keywords in a group of HTML files.   User must configure the "Keywords" string set.
Included in distribution package.


CountLines.pxd

Computes the line count of a group of text files.   See the tutorial for more information.
Included in distribution package.


CSV-DB.pxd

Imports CSV (comma-separated-variable) data to database table "csvdata".   Transmits fields 1-13 in the input to fields A-M in the database.   Ignores the first line of input, which is assumed to contain layout information.   Can easily be adapted to import a different number of fields.

Contact WWWGrab.com for more information.


EMail-Basic.pxd

Loads all messages from the selected folder into database table "message".   Loads basic message items: Sender name/email, recipient name/email, subject, date, the entire text body, and the source message store / folder names.

Contact WWWGrab.com for more information.


EMail-FromName.pxd

Loads messages with selected sender names into database table "message".   Modify the "BeginsWith" and "EndsWith" string sets to filter on sender names that begin with and end with the desired text.

Contact WWWGrab.com for more information.


EMail-FromMail.pxd

Loads messages with selected sender email addresses into database table "message".   Modify the "BeginsWith" and "EndsWith" string sets to filter on sender email addresses that begin with and end with the desired text.

Contact WWWGrab.com for more information.


EMail-NearTextRTF.pxd

Version of EMail-SearchWordPairs.pxd that outputs RTF text.

Contact WWWGrab.com for more information.


EMail-SearchText.pxd

Identifies messages containing any of the text entries listed in the SearchText string set.
Included in distribution package.


EMail-SearchWordPairs.pxd

Identifies messages with proximate text strings, i.e. word pairs, near each other, creates an HTML file.   Check the screen shot.
Included in distribution package.


EMail-To-Database.pxd

Sample parser for generated emails, parses messages in selected folders and transmits selected information to the database.   This sample parses eBay "end of auction" messages and loads a table called "eauction".   See the extracting data from online correspondence topic.
Included in distribution package.


FileSearch.pxd

Searches for one or more text strings in the input files.   Mimics the output of grep.   Check the screen shot.   User must configure the "text to find" string set.   See the tutorial for more information.
Included in distribution package.


FileSearchAndReplace.pxd

Searches for and replaces one or more patterns in the input files.   User must configure the "new text" string set.   See the tutorial for more information.
Included in distribution package.


FilterJunk.pxd

Extracts printable characters (ASCII 30-126) from the input, discards everything else.  
Included in distribution package.


GenerateSiteMap.pxd

Generates a site map (web page) from web pages (HTML files) in a directory.   Extracts the TITLE and description META HTML tags for each page.   Used to generate this site's map.
Included in distribution package.


HTMLLines.pxd

Replaces newline characters with HTML <BR> tags.   Can be used as a post-processor to preserve newlines when converting to HTML.
Included in distribution package.


HTMLUnicodeMapsToC.pxd

Converts ISO8859-to-Unicode maps to C include files.   Reads the HTML files, filters out the hexadecimal ISO8859-to-Unicode mapping values and formats them so that they can be included and compiled in a C program.   The input maps can be found at :  ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859.

Contact WWWGrab.com for more information.


NewLine-LF-CRLF.pxd

"Fix" line feed characters - replace line feeds (0x0A) with carriage return / line feeds (0x0D0A).
Included in distribution package.


NewLine-CRLF-LF.pxd

Replace carriage return / line feeds (0x0D0A) with line feeds (0x0A).
Included in distribution package.


PADScan.pxd

Parses a PAD (Portable Application Description) XML file and loads selected values into a database table.

Contact WWWGrab.com for more information.


ParseXML.pxd

XML parsing sample.   See the XML Parsing topic for more information.


RTF2HTML.pxd

Generates web content (HTML) from RTF (rich text format) files.   Does a decent job of converting the old Set Machine RTF help file to HTML.   May require modification for other RTF files!   Builds links and a separate index file too.

Contact WWWGrab.com for more information.


RTFLines.pxd

Replaces newline characters with RTF "\line" tokens.

Contact WWWGrab.com for more information.


SwapBytes.pxd

Swaps consecutive bytes in a group of files.   Big-endian to little-endian and vice-versa.

This application of Set Machine is trivial and inefficient (and a little silly), but it works.   Note that it handles the leftover byte if, for some reason, this transformation is performed on a file with an odd byte count.   Processing the odd last byte allows for round-trip conversions that return the input files to their starting states.

This application of Set Machine is inefficient because Set Machine examines the value of every byte, which, for this task, is not necessary.

Contact WWWGrab.com for more information.


TextSearchWide.pxd

Performs wide-string (Unicode) text search.

Contact WWWGrab.com for more information.


WholeWordsOnly.pxd

Performs a "whole-word-only" search for a string.

Contact WWWGrab.com for more information.


Legacy data conversion

Set Machine can be configured to transform many legacy data formats into databases, XML, or other formats.

Examples

Set Machine has been used to :

Contact WWWGrab.com for more information.


Set Machine technical notes

Set Machine can be configured to perform a practically infinite variety of transformation / parsing / filtering tasks, on both stored emails and files.   A number of capabilities not currently included :

... will be implemented in future releases.

Set Machine has been thoroughly tested on Windows XP and Windows Vista.   Preliminary checks indicate that it functions correctly on other 32-bit Windows systems.

View the PAD file:  setmachine.htm   (XML version: setmachine.xml)


System requirements

32-bit Windows, Pentium processor, 5MB available hard disk space.   WWWGrab requires an Internet connection.   The WWWGrab samples require Microsoft Access.

In order to make use of the database output actions a DBMS with suitable ODBC driver is required.   Most DBMSs available on MS Windows platforms meet this requirement.

In order to read messages (emails) Set Machine requires installation of MAPI (Messaging Application Programming Interface).   Most Windows platforms with installed email clients (e.g. MS Outlook) meet this requirement.


Debugging Set Machine

Pressing Ctrl-D activates debug mode, pressing Ctrl-D again deactivates it.   Activating debug mode calls the extension DLL with UserIndex = 99 :

Development Tools sample PXX.CPP responds to UserIndex 99 by producing a dialog box that shows recognized patterns and other information.   Build your own extension DLL with the Set Machine Development Tools ...


Set Machine development tools

Download the development tools   (self-installing executable, 1.3 MB)

Development tools files :

smc.exe   The command line version of Set Machine
setmachine.dll   Set Machine library
smlib.h   setmachine.dll function interface definition
pxx.h   Set Machine User-Defined Function (UDF) interface definition
ipx.h   IPX2 interface definition (required by smlib.h)
ixxinbuf.h   IXXINBUF (input buffer) interface definition (required by pxx.h)
xxdefs.h   Basic #defines, typedefs, etcetera (required by ixxinbuf.h)
pxx.cpp   Sample UDF implementation
smc.cpp   Source code for smc.exe (illustrates use of IPX2 interface to setmachine.dll)
smclient.cpp   Sample setmachine.dll command-line client
pxdb.dll   Database interface DLL (required by setmachine.dll)
fsel.dll   Message system interface DLL (required by setmachine.dll)
ReadMe.txt   Description of the Development Tools files

SetMachine.DLL exports two functions :

SetMachine.DLL requires prior installation of SetMachine.EXE.   SetMachine.DLL also requires registration after the 30-day evaluation period.

SMC.EXE is the command line version of SetMachine.   It accepts a single command line argument, the .PXD file to run, and calls SetMachine.DLL.




Copyright © 2002-2008 WWWGrab.com.   All Rights Reserved.