I wrote a web site scraping tool recently; the tool’s very specific, but I learned some general tips in the process.
The WebBrowser control is handy for displaying an HTML report. Assign the HTML text to its DocumentText property. If the HTML contains links, you can have these open in your default browser by handling the Navigating event:
private void webResults_Navigating(...)
{
// If the user clicks on a link, open it in the default browser.
// Setting DocumentText directly will send a URL of about:blank.
if (e.Url.ToString() != "about:blank")
{
System.Diagnostics.Process.Start(e.Url.ToString());
e.Cancel = true;
}
}
The WebClient class does not appear to work properly with cookies. If you need to send/receive cookies, use the WebRequest class instead:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = _UserAgent;
request.CookieContainer = cookies;
using (WebResponse response = request.GetResponse())
{
using (Stream responseStream = response.GetResponseStream())
{
using (StreamReader reader = new StreamReader(responseStream))
{
html = reader.ReadToEnd();
}
}
}
I copied the _UserAgent static string contents from Help/About in Firefox. Cookies to send can also be copied from Firefox (but note that they can expire); create ones to send as follows:
CookieContainer cookies = new CookieContainer();
Cookie c = new Cookie(_AuthenticationCookieName,
_AuthenticationCookieValue, "/", _Domain);
cookies.Add(c);
The HtmlAgilityPack saved me a huge amount of time parsing retrieved web pages. Sure, you could do this with regular expressions, but why? HtmlAgilityPack lets you navigate HTML like XML, using XPath. Available for free at http://www.codeplex.com/htmlagilitypack. I find the license unclear; I think it’s non-viral for collective works, but I’m not positive.
(I still used regular expressions in a few places; they’re just too handy.)
The Firefox DOM Inspector add-on was indispensable in analyzing the pages to scrape.
This all works well, but it took considerable time to write and test. It made me wish for a “browser macro recorder” similar to MS Word’s; something that would watch my activities and record them as script code. The code could then be edited and replayed. Suggestions are welcome.
No Comments/Pingbacks for this post yet...
Development Central is the blog of Bill Sorensen, a professional software developer. Much of this will relate to C#, .NET, and OOP in general.
Disclaimer
These postings are provided "AS IS" with no warranties and confer no rights.