Share via


Spider in .NET

Crawl Web Sites and Catalog Info to Any Data Store with ADO.NET and Visual Basic .NET

Mark Gerlach

Code download available at:SpiderInNet.exe(549 KB)

This article assumes you're familiar with Visual Basic, .NET, and ADO

Level of Difficulty123

SUMMARY

Visual Basic .NET comes loaded with features not available in previous versions, including a new threading model, custom class creation, and data streaming. Learn how to take advantage of these features with an application that is designed to extract information from Web pages for indexing purposes. This article also discusses basic database access, file I/O, extending classes for objects, and the use of opacity and transparency in forms.

Contents

Coding the Application
Creating the Threads
Searching the Stream
Handling the Errors
Error Logging
My Little Repository
What Happened to ItemData?
Transparency and Opacity
Conclusion

Whenever you begin to use a new language, or the language you're using gets a major update, all the new syntax, interface changes, new functions, and such can take some practice to master. To get some hands-on experience with Visual Basic® .NET, I decided to rewrite an application from my company's catalog to enhance its features, and at the same time, showcase some of the features of the language.

This article is based on the reconstruction of an HTML spider application built in Visual Basic 6.0. The purpose of the original app was to pull raw HTML from a number of different Web sources, parse the data into a set of database tables, and then figure out what information a user might want to see. It read and followed the directions in the robots.txt file on a site, would save complete HTML code for local site building on a user's hard drive, and would log information about major site structure changes and any errors it encountered. For the reconstructed product, which I named Spider, I retained as much of the previous version's functionality as possible. In addition, I wanted to really put the Microsoft® .NET Framework to the test. Note that Spider was first built using the Beta 2 release of Visual Studio® .NET and the new version has been tested with the final release of Visual Studio.

While the product was originally well designed, there were certain features I wanted to include that couldn't be built easily in Visual Basic 6.0. One such feature is a true multithreaded object that can crawl pages independent of a parent application (regardless of what the other threads are doing), and report back to the parent application periodically with its status. This feature is the perfect application of .NET threading models.

In this article I will only discuss the key features that showcase techniques specific to Visual Basic .NET: crawling through multithreading, adhering to robots.txt, eliminating unknown file extensions, Internet streaming, UI modifications, error logging, and database operations. Obviously, a final release product would do much more.

In addition, the screens and functions offered by the program depend on the spider's applied use. For instance, it would be useless to have a parsing module if writing a copy of the HTML stream to the local drive is all that's needed. If you plan to run the product on a multiprocessor box, the limitation of five threads might seem a bit confining. Because apps like this spider frequently need to be created in several versions to meet a variety of requirements, I decided to build more configuration features into it. The goal was to allow it to interact with a number of different file formats, HTML types, and parsing instructors.

Coding the Application

I began coding the project by opening Visual Studio .NET and creating a new Visual Basic project using the Windows Application template. I typed in the name of my new project, Spider, and clicked OK. Next, I created the UI. I did this first so I'd have a clear-cut picture of what the spider would look like to the user. Besides, it was a good place to start.

For the most part, the form creation process was quite similar to previous versions of Visual Basic. However, since Visual Basic .NET shows you the source code it uses to build the controls on a form (previously hidden at the top of the .frm file), I found that I could copy the source code for FormA and paste it into FormB, and the controls from FormA would be instantly recreated on FormB. In Visual Basic 6.0, the code would be transferred, but controls would have to be moved using a separate copy/paste process.

Figure 1 Main Screen

Figure 1** Main Screen **

Once the forms were created, I added controls to each of them. I configured the main application screen to look like Figure 1 and the Options screen to look like Figure 2.

Figure 2 Options Screen

Figure 2** Options Screen **

Finally, I added the classes and modules that I would need. By right-clicking the project name in the Solutions Explorer window again, I created a module named modUtils and two classes, clsBoxOverride and clsSpider. With all of the structures created, I could start adding the code behind them.

Creating the Threads

As I mentioned earlier, threading was a major factor in the decision to rebuild this application, so let's take a look at it now. There are two main components of this application involved in threading—the spider application itself and the class object that's fired by the application to retrieve pages from a Web site. Look at Figure 3 to see the overall architecture of the spider application and how the different pieces interact.

Figure 3 Spider Thread Model

Figure 3** Spider Thread Model **

Spider allows the user to choose the number of active threads to fire during processing. In this application, the number is limited to five, since most readers won't be running quad-processor boxes to beta test spider code. The application fires as many individual clsSpider objects as the user requests, and keeps track of each thread and its current state. These thread processes, in turn, grab HTML source from the page they are directed to access, write the HTML source (parsed if necessary) to the local database, and pass back to the spider application any new received links that need to be processed. As the individual threads proceed, they periodically update the main application with their current status through the ehThreadStatus event handler. This handler updates the main form in the application with thread information, the current progress for large file streams, and completed files. When complete, each thread automatically shuts itself down. There is no need to call the Thread.Abort method.

Thread-safe code is of the utmost importance for this type of application. Otherwise, the threads will step all over one another, causing the program to run more slowly than a single-threaded application—or much worse, crash altogether. To prevent this, the Visual Basic .NET design team provided the ability to insert a SyncLock block.

The SyncLock function is passed an expression containing the object to be locked. For an example, take a look at the code used in the ehFinished event handler in frmMain (see the download at the link at the top of this article). Here, the system checks to see if a lock can be obtained on the m_sPages array (this array holds a listing of pages that have yet to be retrieved) by executing the following line of code:

SyncLock (m_sPages) ' Insert code dealing with the SyncLock Block here End SyncLock

Since all three of Spider's event handlers are fired using the RaiseEvent method, the event handlers are fired on the same thread as the calling process—in this case, the clsSpider class—and not on the thread of the main application. Because of this, it's extremely important to use the SyncLock method to ensure that two (or more) threads do not access the m_sPages array at the same time, as this can cause application deadlocks.

If the SyncLock cannot be obtained immediately, the application will wait until it can be obtained. If a SyncLock can be obtained, all other threads are kept from accessing the m_sPages array until the SyncLock block is exited. This block performs some utility functions, such as figuring out if the extension of the returned pages is on the list of accepted suffixes. But the most important function is the addition to the m_sPages array of new pages, found by the current thread, that need to be retrieved and parsed. Without the ability to SyncLock the threads, two threads might try to add new elements to the top of the array at the same time and try to write two duplicate values to the upper bound of the array. This could result in a system error or perhaps even the loss of pages that need to be parsed.

It is also important to note that at no time should SyncLock be used to lock threads or processes that make changes to a form or control. Forms and controls sometimes call back to a calling procedure. If this is the case, a deadlock will most likely occur. However, you will notice that the ehThreadStatus event handler uses a SyncLock block to lock the form before updating the progress bar controls. Because a deadlock might be caused by two or more threads trying to access that listbox simultaneously, it is important that the form be locked during update by a single thread. This is different from SyncLocking the thread that updates the form. By SyncLocking the form itself, I am ensuring that two threads will not step on each other while accessing it. There is no point to SyncLocking the thread during this process, since there are no other threads accessing it.

When the Process button is clicked, the application creates an array of Thread objects. The first thread in this array is used to pull back the first page on the target site. When the first page is returned, it is processed by the ehFinished event handler. In the meantime, Spider is in a loop waiting for the first thread to return or for the m_sPages array to have additional values inserted into it. If the first thread returns with no additional pages, the application exits the processing phase and proceeds to the next selected site.

As long as the m_sPages array contains values, the loop will assign one page from the array to an inactive clsSpider thread. At the same time this assignment is made, the page to be processed is removed from the m_sPages array and added to the m_sPagesDone array. The m_sPagesDone array is also checked before firing any new threads to make sure the page about to be crawled has not yet been done. This process is continued until there are no pages remaining in any site that's been crawled, and there are no sites remaining to be crawled.

Searching the Stream

.NET introduces to Visual Basic a concept known as streaming. A stream is basically an incoming sequence of bytes originating from an Internet request, a file, or an I/O device such as a keyboard or mouse. Spider deals with two stream types—Internet and file.

An Internet stream is a relatively simple concept. All Internet servers stream data to a client through TCP/IP packets in response to requests sent to that server. To add this functionality to the spider, I first imported the System.Net library by including the following line near the top of my class:

Imports System.Net

Next, I created a Web request:

Dim myReq As HttpWebRequest = HttpWebRequest.Create(m_sURLToProcess)

The m_sURLToProcess is a string sent by the application that tells each thread which URL to process. Next, I created a response object as follows:

Dim myResponse As HttpWebResponse = myReq.GetResponse

At this point, I have a response object holding a stream of HTML data from the target site. But I also want to know what page the Web server sent back. For instance, in some cases I requested the root of the site (https://www.optigontech.net/) and the Web server returned the default page (https://www.optigontech.net/default.asp), which I need to know for logging to the database, and also to ensure that I do not reexamine that page if I find a link to it in another HTML stream. So, I call the following line:

m_sPageResponded = myResponse.ResponseUri.ToString()

Another issue that I need to address at this point is that the stream is not usable in its current state. The incoming data needs to be in the form of a byte array or string—something I can manipulate. To get it into this form, I created some holders for the stream along with a BinaryReader object, which will facilitate my reading from the stream:

Dim iContentLength As Integer, sTotalBuffer() As Byte iContentLength = myResponse.ContentLength Dim br As New BinaryReader(myResponse.GetResponseStream())

Next, I used a loop based on content length to retrieve the stream block by block:

ReDim sTotalBuffer(iContentLength - 1) Dim sBuffer() As Byte m_iBytesRead = 1: iTotalBytes = 0 Do Until m_iBytesRead = 0 ReDim sBuffer(iContentLength - 1) m_iBytesRead = br.Read(sBuffer, 0, iContentLength) ReDim Preserve sBuffer(m_iBytesRead - 1) If m_iBytesRead > 0 Then Array.Copy(sBuffer, 0, sTotalBuffer, iTotalBytes, sBuffer.Length) iTotalBytes += m_iBytesRead RaiseEvent evThreadStatus(iTotalBytes, iContentLength, m_iThreadIndex) Loop

Note the RaiseEvent line near the end of the loop. It tells the main program how much of the page has actually been converted from the stream. Remember, this event handler is called on the same thread that the clsSpider is running on. In addition, you should note the Array.Copy function:

Array.Copy(sBuffer, 0, sTotalBuffer, iTotalBytes, sBuffer.Length)

This line replaces the old memcpy API call used in previous versions of Visual Basic. The Array.Copy method is much easier to use. By specifying a source array, starting element, target array, starting point, and number of bytes to copy, I can instantaneously copy chunks of memory without having to do complex functions through calls to external libraries.

In the actual program, you'll notice an If statement that determines the content length. There are two types of Internet streams—those that allow seeks and those that don't. I like to think of this in terms of scrolling. In a seekable stream, I can move back and forth within the stream object, retrieving bytes from any position. In a non-seekable stream I most move firehose-style in one direction through the stream. The code in the clsSpider class determines which type of stream has been retrieved and how to approach that stream for the retrieval of information.

When this code is finished, I have a string (sTotalBuffer) that contains the complete HTML listing from the page. From here, I pass this string to a number of different functions for processing.

Handling the Errors

Now that I have basic functionality in both the spider application and the clsSpider class, it is time to address error handling. Visual Basic .NET has added a whole new set of error handling that far surpasses the clunky On Error Goto handlers of previous releases. With Try, Catch, and Finally, Visual Basic .NET allows a user to include some very structured error handling routines in their code. I found this very useful in some areas of Spider.

The ProcessRobots function, for example, looks for a robots.txt file in the root of the target Web server and always returns an error code of 5 when it can't find it. It's nice to be able to put something like this in the code:

Try ••• Catch When Err.Number = 5 'This is just it telling us it didn't find the Robots.txt file Debug.WriteLine("ProcessRobots: " & Err.Number & ": " & _ Err.Description &" " & Err.Erl) WriteErrorLog("ProcessRobots: " & Err.Number & ": " & _ Err.Description & " & Err.Erl) End Try

I can also nest Try/Catch blocks inside of other blocks, which is useful when I know that certain errors are going to occur (such as problems with Unix Web servers or certain types of HTML streams). The StartThreading method is a good example of this. The initial Try block is where I start my thread. If at any time during this crucial process, I encounter an error, I want to try to shut down all active threads for the application and exit the function. Inside my first Catch block I have the following code:

Debug.WriteLine("StartThreading: " & Err.Number & ": " & _ Err.Description & " " & Err.Erl) WriteErrorLog("StartThreading: " & Err.Number & ": " & _ Err.Description & " " & Err.Erl)

First, I insert the error handling for the error that I ran into. Then, I insert a nested Try/Catch block to shut down the individual threads as shown here:

Try ••• Try For i = LBound(m_oThread) To UBound(m_oThread) If Not IsNothing(m_oThread(i)) Then m_oThread(i).Abort() Next i Catch Debug.WriteLine("StartThreading: " & Err.Number & ": " & _ Err.Description & " " & Err.Erl) WriteErrorLog("StartThreading: " & Err.Number & ": " _ & Err.Description & " " & Err.Erl) End Try End Try

This is something that was not available in previous versions of Visual Basic without employing spaghetti code.

Error Logging

In addition to the Try/Catch blocks for error handling, I like having a log file for errors that tells me where a certain process might have failed. So, I constructed a function to place errors in a file in the application's root. This is just a basic .txt file, but the date and time is recorded for each error, and a number and description of the error, the function that threw the error, and any applicable line numbers referencing the failed line are included.

I placed the function called WriteErrorLog into the modUtils.vb file. This is a good example of file I/O. First, I created a name for the error file:

Dim sErrFileName As String = Application.StartupPath & _ "\Spider_" & Format(Now(), "yyyMMdd") & ".txt"

Note the Application.StartupPath method, which has replaced the App.Path method from earlier versions of Visual Basic.

Then I opened a TextWriter object that will allow me to append text to the file stream:

Dim wt As TextWriter = File.AppendText(sErrFileName)

The next line throws a thread-safe wrapper around the TextWriter object so that multiple threads will not crash when they attempt to access the object simultaneously:

TextWriter.Synchronized(wt)

Next, I write a line to the file and close the TextWriter object:

wt.WriteLine(Format(Now, "hh:mm:ss tt") & " " & sErrString) wt.Close()

In the older version of the spider, numerous errors were generated when multiple processes tried to log errors to a file at the same time; this slowed the application to a dead halt. With the addition of the TextWriter.Synchronized function, I don't have to worry about those types of errors in the reconstructed .NET application.

My Little Repository

The full version of Spider has the ability to write to .txt files, SQL Server™ databases, and myriad other systems. For this article, the application writes to a simple Microsoft Access database. To accomplish this, I had to import some basic functionality so I didn't have to type the library headers over and over:

Imports System.Data Imports System.Data.OleDb Imports System.Data.SqlClient

Next, I created a variable that contains my database connection string as shown here:

Public g_sConn As String = "Provider=Microsoft.Jet.OLEDB.4.0;Data _ Source=" & Application.StartupPath & "\spider.mdb"

After this, I was ready to access the database. I really didn't need anything fancy, so I created two functions—one that returns a ReadOnly datareader object (drGetReadOnly) and one that executes a SQL string query against the database of my choice (lExecuteSQL) and returns to the calling function the number of affected rows.

The functions drGetReadOnly and lExecuteSQL are shown in Figure 4 and Figure 5, respectively. The differences between the two functions are slight; drGetReadOnly uses the ExecuteReader method of the OLE DBCommand object, whereas lExecuteSQL uses the ExecuteNonQuery method.

Figure 5 lExecuteSQL

'Execute a passed SQL statement and return the number of records affected Public Function lExecuteSQL(ByVal sSQL As String) As Long Try 1: Dim cn As New OleDbConnection(g_sConn) 2: cn.Open() 3: Dim cmd As OleDbCommand = New OleDbCommand(sSQL, cn) 4: Dim lRecsAffected As Long = cmd.ExecuteNonQuery() 5: cn.Close() 6: Return lRecsAffected Catch 'MsgBox(Err.Number & ": " & Err.Description & " " & Err.Erl) 10: Debug.WriteLine("lExecuteSQL: " & Err.Number & ": " & _ Err.Description & " " & Err.Erl & vbCrLf & sSQL) 500: WriteErrorLog("lExecuteSQL: " & Err.Number & ": " & _ Err.Description & " " & Err.Erl & vbCrLf & sSQL) End Try End Function

Figure 4 drGetReadOnly

'Retrieve a datareader from the database Public Function drGetReadOnly(ByVal sSQL As String) As OleDbDataReader Try 1: Dim cn As New OleDbConnection(g_sConn) 2: cn.Open() 3: Dim cmd As OleDbCommand = New OleDbCommand(sSQL, cn) 4: Dim dr As OleDbDataReader = cmd.ExecuteReader() 5: Return dr Catch 'MsgBox(Err.Number & ": " & Err.Description & " " & Err.Erl) 10: Debug.WriteLine("drGetReadOnly: " & Err.Number & ": " & _ Err.Description & " " & Err.Erl) 500: WriteErrorLog("drGetReadOnly: " & Err.Number & ": " & _ Err.Description & " " & Err.Erl) End Try End Function

What Happened to ItemData?

In Visual Basic 6.0, I used the ItemData property of listboxes and comboboxes to hold information about each line. Most often, I used it to hold record IDs for each entry. If I wanted to hold more information, I had to create a collection and associate each item in the collection with an item in my listbox or combobox. In Visual Basic .NET, these methods have been replaced with a much more useful technique.

In Visual Basic .NET, I can create a custom class with its own properties (see Figure 6). In the class definition of clsBoxOverride, I override the ToString method and tell it which value I'd like it to return. I also create a New method in the class to allow me to pass arguments to the class when it's instantiated.

Figure 6 clsBoxOverride

'################################################# ' ' clsboxOverride ' This is the class that we use to pass ' information to the site's listbox on the ' main page. It allows us to use an ' itemdata-like function - which no longer ' exists in Visual Basic .NET, but add additional ' elements that we can store alongside the ' entries in a listbox. We used to have ' to use a collection or an array to maintain ' these. ' ' Entry Point: New ' '################################################# Option Explicit On Public Class clsboxOverride 'This class was constructed to override the fact that there no longer 'exists an itemdata property for the listboxes in my project Private m_sSiteName As String Private m_sSiteURL As String Private m_iSiteID As Long Public Sub New(ByVal sSiteName As String, ByVal sSiteURL As String, _ ByVal iSiteID As Long) m_sSiteName = sSiteName m_sSiteURL = sSiteURL m_iSiteID = iSiteID End Sub Public Property SiteName() As String Set(ByVal Value As String) m_sSiteName = Value End Set Get Return m_sSiteName End Get End Property Public Property SiteURL() As String Set(ByVal Value As String) m_sSiteURL = Value End Set Get Return m_sSiteURL End Get End Property Public Property SiteID() As Long Set(ByVal Value As Long) m_iSiteID = Value End Set Get Return m_iSiteID End Get End Property Public Overrides Function ToString() As String Return m_sSiteName End Function End Class

I then create an instance of that class and add it to the listbox or combobox using the Items.Add method, passing parameters for default class values. The listbox control queries the ToString method to get a value to display. Instead of using a collection to gain access to extra properties, I simply reference the property to retrieve:

lstSites.Items(iIndex).SiteURL.ToString()

Note the differences in the clsBoxOverride class when dealing with properties. Set and Get are now defined within each property:

Public Property SiteName() As String Set(ByVal Value As String) m_sSiteName = Value End Set Get Return m_sSiteName End Get End Property

In Spider, I use this functionality in a MouseMove event to display the URL associated with a site description in the lstSites listbox.

Transparency and Opacity

And last (but surely not least) I can add a little bit of pizzazz to the application through two new features in Visual Basic .NET—transparency and opacity.

I've always wanted to put a cool splash screen on my application, but was limited to boxy-looking, two-dimensional screens. Once again, the Visual Basic .NET development team gave me an option: form transparency. Basically, form transparency allows me to make one color transparent on any form I create. So on the form properties for the splash screen where I've placed my color logo on a white background, I set the TransparencyKey property to white. The result is just what I was looking for (see Figure 7). Any white areas on my splash screen show through to the application underneath.

Figure 7 Form Transparency

Figure 7** Form Transparency **

While I was at it, I decided to fade my main screen by using the new opacity settings for Visual Basic .NET forms:

'Change the opacity on the form - fade it into view Me.Opacity = 0 Me.Show() For i = 1 To 100 Step 10 Me.Opacity = i / 100 Application.DoEvents() Next i Me.Opacity = 1

A form's Opacity property can be a fractional value from 0 (invisible) to 1 (visible). The block of code just shown increases the value of Opacity in increments of 1/10 until the form is visible. You can also increment the denominator by ones and step any value that you want.

Note the Application.DoEvents line. When this line is removed, the form tends to jump from invisible to visible. This line gives the display time to catch up with the application.

Conclusion

My company's clients currently use Spider to retrieve information from local, state, and national sites. With some tweaking, the spider can be used to get address listings from public Yellow Pages, or articles related to certain topics from all over the Web. It could even be used to retrieve newsgroup postings from a news server. Any Web page can be crawled (as long as you obtain permission for copyrighted material first). Information obtained can be placed in a database for later analysis.

Visual Basic .NET comes with many new and exciting features. With time and a little perseverance, you'll learn when to use each. Improved array and error handling, GUI tricks, and stream support are just some of the tools that you have at your disposal when programming in Visual Basic .NET. Go out and discover some more for yourself.

For related articles see:
Creating Classes in Visual Basic .NET
.NET Development

For background information see:
Creating Transparent Windows Forms
Walkthrough: Multithreading
Setting ToolTips for Controls on a Windows Form

Mark Gerlachis Director of Product Development for Optigon Technical Associates, a California-based Microsoft Solution Provider. Mark holds both MCSD and MCDBA certifications. He can be reached at gerlachm@optigon.net or mgerlach@mostwantedsoftware.com.