| Set Machine home | |
| Download | |
| Register | |
| Tutorial | |
| Help | |
| Site map | |
| Contact info |
| Overview |
| Setup |
| Running WWWGrab |
| DSN configuration |
| HTML parsing |
| Options |
| Stock quote scraper |
| Rental listing scraper |
| HTML table scraper |
WWWGrab is a Windows program that scans lists of web page URLs (Uniform Resource Locators), fetching the data at each one and parsing the data with the Set Machine data transformer. What's required :
WWWGrab's processing is controlled by interrelated database table types :
The table names are user-defined. The fields indicated below must be present, but the user can add fields to the tables if needed.
WWWGrab is run from the command line with 2 arguments :
wwwgrab <DSN> <task table>
For example, this command runs the stock quote scraper sample :
"C:\Program Files\WWWGrab\WWWGrab" WWWGrab-Samples quotesget
"WWWGrab-Samples" is the DSN, "quotesget" is the task table name. See the Running WWWGrab topic for more information.
A single task table controls the process, instructing WWWGrab to :
| Task table | |
|---|---|
| Field: | Description: |
| N | The action index (1..N) |
| Exec | Execute this task (yes/no)? |
| Action | The type of task to perform - "Scan" or "ExecSQL" |
| Parameter | Data to be used by the task (a Source or SQL table) |
WWWGrab scans the task table executing the user-defined actions, which can be :
| Scan | - scan the URLs in the Source table whose name is specified in the parameter field |
|---|---|
| ExecSQL | - execute the statements in the SQL table whose name is specified in the parameter field |
The Scan and ExecSQL tasks can be run in any order. For example, you might want to execute SQL before and after a URL scan.
The following task table :
| N | Exec | Action | Parameter |
|---|---|---|---|
| 1 | X | ExecSQL | PreSQL |
| 2 | X | Scan | WebSources |
| 3 | X | ExecSQL | PostSQL |
The Exec field allows the user to switch tasks on and off without having to add / remove them from the table.
This design is flexible enough to allow multiple passes, where, for example, a list of URLs generated by one step is scanned by the next step. Data from multi-level websites can be extracted using this feature.
The URLs and associated data are specified in a "source" database table containing the following information :
| Source table | |
|---|---|
| Field: | Description: |
| URL | The Uniform Resource Locator |
| Scan | Scan this URL (yes/no)? |
| Source | User's description of the URL; passed to Set Machine variable "Source" |
| Parser | The Set Machine .PXD file to be used to parse the URL's data |
| Options | Option string |
| UserName | The user ID (if any) required to access the URL |
| Password | The password (if any) required to access the URL |
| Notes | User notes, free-format, ignored by WWWGrab |
WWWGrab scans the Sources from beginning to end. Each URL is parsed per the associated Set Machine parser (.PXD file).
The following Source table instructs WWWGrab to fetch and parse data from 3 URLs :
| Scan | Source | URL | Parser |
|---|---|---|---|
| X | Con Ed | http://finance.yahoo.com/q?s=ED | Quotes01 |
| X | Borland | http://finance.yahoo.com/q?s=BORL | Quotes01 |
| X | GMOT | http://finance.yahoo.com/q?s=GMA | Quotes01 |
All three URLs are parsed with the Quotes01.PXD parser. The Options, UserName and Password fields, which are not typically used, are not shown in this example.
The task table can trigger the execution of SQL (database Structured Query Language) statements in tables containing the following information :
| SQL table | |
|---|---|
| Field: | Description: |
| N | An index (1..N) controlling order of statement execution |
| Exec | Execute this SQL statement (yes/no)? |
| Command | A (valid!) SQL statement |
SQL tables can be used to prepare data in the Source tables, for example.
The following SQL table instructs WWWGrab to execute 2 SQL statements to prepare the ContributionsSources table for a URL scan :
| N | Exec | Command |
|---|---|---|
| 1 | X | update ContributionsSources set Parser='Contributions' |
| 2 | X | update ContributionsSources set Source='Contributions - '+zip |
In this case the ContributionsSources table has an extra field, "zip". Fields can be added to the WWWGrab tables if necessary to facilitate SQL manipulations.
Once again, an Exec field allows the user to switch tasks on and off without having to add / remove them from the table.
The Command field must contain a valid SQL statement.
WWWGrab version 1.3 fetches and parses static web pages. More powerful web data acquisition methods will be available in future releases.
Before you can run WWWGrab you have to set it up :
[*] The DSN for the samples is configured automatically during installation.
WWWGrab is a Windows console (command line) program. It accepts two command line arguments :
Once WWWGrab is installed and configured enter :
wwwgrab <DSN> <task table>at the Windows command prompt, or other places where command lines are used, such as Windows shortcuts, batch files or the task scheduler. For example :
wwwgrab yourdb getprices... where "yourdb" is the DSN configured for the database and "getprices" is a task table in that database.
To run the stock quote scraper sample :
"C:\Program Files\WWWGrab\WWWGrab" WWWGrab-Samples quotesget
The "WWWGrab" shortcuts created by the installation are set up to run :
wwwgrab WWWGrab-Samples QuotesGet
wwwgrab WWWGrab-Samples RentalsGet
wwwgrab WWWGrab-Samples TablesGet
- the samples.
WWWGrab works in conjunction with Set Machine. Both products offer a 30-day trial period. Register WWWGrab by registering Set Machine!
WWWGrab uses ODBC, "Open Database Connectivity", to interact with user-defined databases. Under ODBC databases are assigned labels, or Data Source Names ("DSNs"). The WWWGrab installer automatically configures the WWWGrab-Samples DSN for the sample database.
To use another database the user must configure a DSN for the database. DSN configuration involves little more than applying a system-wide label to an existing database. Use the "Configure DSN" shortcut created during installation to start the ODBC Data Source Administrator, then :
Visit wwwgrab.com/db.html for more information about databases, DSNs and ODBC.
Web pages are written using HTML, Hypertext Markup Language. WWWGrab parsers are, therefore, HTML parsers, so configuring WWWGrab requires some knowledge of HTML. For example, parsing an HTML table requires an understanding of the TABLE, TR, TH, and TD tags, among others.
In simple cases, the WWWGrab stock quote sample for instance, HTML tags can be ignored. But for more sophisticated parsing the user should be familiar with the basics of HTML - many good books are available on the subject.
Most web browsers allow the user to view the HTML source of a web page. The View/Source menu option in Microsoft Internet Explorer displays HTML source with the Notepad program. Netscape Navigator displays the source with tags and content in different colors, making it easier to see the structure of the web page. WWWGrab supports a "noparse" option that allows saving the URL content to a local file.
Set Machine's design allows creation of reusable parsing components: "node groups". The following HTML parsing node groups are installed with WWWGrab:
| HTMLElement.PXG | - Generic HTML element parser |
| HTMLTag.PXG | - Generic HTML tag parser |
| TR.PXG | - HTML "TR" (table row) parser |
| TD.PXG | - HTML "TD" (table data) parser |
Use of these node groups allows for a relatively clean top-level .PXD file. Consult Set Machine's help for more information.
The WWWGrab installation comes with a few sample configurations :
The Quotes sample is very basic illustration of WWWGrab, extracting a single number (stock quote) from each requested web page.
The Rentals sample extracts rental listings from the CraigsList website.
The Tables sample parses an HTML table. It makes extensive use of Set Machine node groups. See Set Machine help for more information.
Note: The WWWGrab installer automatically configures the WWWGrab-Samples ODBC Data Source Name (DSN). The installer runs the DSNConfig.EXE program, which reads file SampleDSN.txt to configure the sample DSN. The WWWGrab uninstaller removes this DSN.
This WWWGrab sample extracts stock quote data from http://finance.yahoo.com/ and transmits it to the QuotesOutput table in the sample database. A sample Microsoft Access database (Samples.MDB) accompanies the distribution package, but you can use any database (DBMS) that provides a suitable ODBC driver. This sample should work on Windows systems with :
The WWWGrab Quotes sample uses three tables :
The QuotesGet task table simply directs WWWGrab to scan the QuotesSources table :
| N | Exec | Action | Parameter |
|---|---|---|---|
| 1 | X | Scan | QuotesSources |
The QuotesSources table lists the stock quotes of interest :
| Scan | Source | URL | Parser |
|---|---|---|---|
| X | First Solar, Inc. | http://finance.yahoo.com/q?s=FSLR | Quotes01.pxd |
| X | Evergreen Solar | http://finance.yahoo.com/q?s=ESLR | Quotes01.pxd |
| X | Energy Conversion Devices | http://finance.yahoo.com/q?s=ENER | Quotes01.pxd |
The "Source" field is user-defined text passed into Set Machine, where it can be transmitted to the output (via Set Machine variable "Source"). This example uses the same .PXD file (Quotes01) for each entry; different .PXD files can be used if needed.
| Source | Quote | Grabbed |
|---|---|---|
| First Solar, Inc. | $122.00 | 12/5/2008 12:06:04 AM |
| Evergreen Solar | $2.27 | 12/5/2008 12:06:04 AM |
| Energy Conversion Devices | $25.81 | 12/5/2008 12:06:05 AM |
The quote is extracted from the website. The date/time (Grabbed) is generated by WWWGrab/Set Machine. The Source text is passed from the input table into the output. Open the sample configuration file, Quotes01.PXD, with Set Machine to see how this simple HTML parser extracts data from web pages fetched by WWWGrab.
You can modify the QuotesSources table to extract any number of quotes. The URL should end with a valid symbol :
http://finance.yahoo.com/q?s=GOOG... using Google as an example.
This WWWGrab sample extracts rental listing data from CraigsList and transmits it to the RentalsOutput table in the sample database.
The WWWGrab Rentals sample uses four tables :
The RentalsGet task table first runs the SQL in the RentalsPreSQL table to delete all the records in the RentalsOutput table. It then starts the scan of the RentalsSources table, which populates RentalsOutput with the desired listings.
This WWWGrab sample extracts data from a web page table.
This sample uses three database tables :
Keep in mind that the word "table" is used two ways here :
TablesSources contains the URL / parser list :
| Scan | Source | URL | Parser |
|---|---|---|---|
| X | Wikipedia | http://en.wikipedia.org/wiki/List_of_U.S._states_by_population | Tables01.pxd |
The Tables01.PXD parser starts by looking for "Locators" (a Set Machine string set) in order to skip over tables that we're not interested in. The entries in the Locators string set contain keywords that distinguish the desired table from other tables that may precede it in the web page being parsed. Click on the "Configure parsers" shortcut and open Tables01.PXD to see how this works.
| Source | TD1 | TD2 | TD3 | TD4 |
|---|---|---|---|---|
| Wikipedia | 01 | 01 | California | 36553215 |
| Wikipedia | 02 | 02 | Texas | 23904380 |
| Wikipedia | 03 | 03 | New York | 19297729 |
| Wikipedia | ... | ... | ... |
Open the sample configuration file, Tables01.PXD, with Set Machine to see how this HTML parser extracts data from web pages fetched by WWWGrab.
Note the choice of generic field names "TD1" .. "TD4" in the database (Samples.MDB) and the parser (Tables01.PXD). If you adapt this sample you can make those names more meaningful.
WWWGrab supports several options that can be specified on a per-URL basis, in the Source table "options" field :
| Option: | Purpose: |
|---|---|
| noparse | Writes the URL's content directly to a file without
parsing it [*] |
| trace | Increases the amount of diagnostic output produced |
| retries | Sets the maximum number of retries if an HTTP redirect occurs [**] |
| decode | Parses the URL using the AfxParseURLEx ICU_DECODE option [***] |
| noencode | " AfxParseURLEx ( ICU_NO_ENCODE ) |
| nometa | " AfxParseURLEx ( ICU_NO_META ) |
| encodespacesonly | " AfxParseURLEx ( ICU_ENCODE_SPACES_ONLY ) |
| browsermode | " AfxParseURLEx ( ICU_BROWSER_MODE ) |
[*] The "noparse" option enables you to write the fetched URL's content directly to a local file, without modification. The output file path is specified in the sources table parser field. This option provides the unmodified results of the HTTP Get command. This can be very useful for parser development.
[**] The maximum number of HTTP redirects defaults to 3.
[***] WWWGrab uses a function called "AfxParseURLEx" to parse the URL prior to fetching the data. Visit http://msdn.microsoft.com/ for more information on AfxParseURLEx and its options.
| Scan | Source | URL | Parser | Options |
|---|---|---|---|---|
| X | Sample data | http://www.wwwgrab.com/sample.html | Tables01 | retries=0 |
| X | Sample data | http://www.wwwgrab.com/sample.html | c:\urldump\sample1.html | noparse |
| X | Sample data | http://www.wwwgrab.com/sample.html | Tables01 | trace noencode |
The 1st example sets the number of HTTP redirects to zero.
The 2nd example writes the fetched URL's content, unmodified, to file c:\urldump\sample1.html.
The 3rd example illustrates the combination of two options, separated by a blank.
| WWWGrab.exe | the WWWGrab executable |
| WWWGrab.html | WWWGrab help (this file) |
| SetMachine.dll | Set Machine component |
| SetMachine.exe | Set Machine GUI program |
| SetMachine.chm | Set Machine help file |
| cw3230mt.dll | library required by Set Machine |
| mfc71d.dll | library required by WWWGrab |
| msvcp71d.dll | " |
| msvcr71d.dll | " |
| pxdb.dll | " (Set Machine database access DLL) |
| pxx.dll | " (Set Machine extension DLL) |
| pxxx.dll | " (Set Machine extension DLL) |
| fsel.dll | " (Set Machine email folder selection DLL) |
| DSNConfig.exe | DSN configuration program |
| ReadMe.txt | General information |
| *.PXD | sample parsers |
| *.PXG | Set Machine node group parsers, used by *.PXD |
| Samples.MDB | sample MS Access database |
| SampleDSN.TXT | used by DSNConfig.exe to configure the sample DSN |
Questions, comments? Contact WWWGrab.com.
| Set Machine home | Download | Register | Tutorial | Help | Site map | Contact info |