Web scraping in .NET

12/30/07

Web scraping in .NET

I wrote a web site scraping tool recently; the tool’s very specific, but I learned some general tips in the process.

The WebBrowser control is handy for displaying an HTML report. Assign the HTML text to its DocumentText property. If the HTML contains links, you can have these open in your default browser by handling the Navigating event:


private void webResults_Navigating(...)
{
  // If the user clicks on a link, open it in the default browser.
  // Setting DocumentText directly will send a URL of about:blank.

  if (e.Url.ToString() != "about:blank")
  {
    System.Diagnostics.Process.Start(e.Url.ToString());
    e.Cancel = true;
  }
}

The WebClient class does not appear to work properly with cookies. If you need to send/receive cookies, use the WebRequest class instead:


HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

request.UserAgent = _UserAgent;
request.CookieContainer = cookies;

using (WebResponse response = request.GetResponse())
{
  using (Stream responseStream = response.GetResponseStream())
  {
    using (StreamReader reader = new StreamReader(responseStream))
    {
      html = reader.ReadToEnd();
    }
  }
}

I copied the _UserAgent static string contents from Help/About in Firefox. Cookies to send can also be copied from Firefox (but note that they can expire); create ones to send as follows:


CookieContainer cookies = new CookieContainer();

Cookie c = new Cookie(_AuthenticationCookieName,
  _AuthenticationCookieValue, "/", _Domain);

cookies.Add(c);

The HtmlAgilityPack saved me a huge amount of time parsing retrieved web pages. Sure, you could do this with regular expressions, but why? HtmlAgilityPack lets you navigate HTML like XML, using XPath. Available for free at http://www.codeplex.com/htmlagilitypack. I find the license unclear; I think it’s non-viral for collective works, but I’m not positive.

(I still used regular expressions in a few places; they’re just too handy.)

The Firefox DOM Inspector add-on was indispensable in analyzing the pages to scrape.

This all works well, but it took considerable time to write and test. It made me wish for a “browser macro recorder” similar to MS Word’s; something that would watch my activities and record them as script code. The code could then be edited and replayed. Suggestions are welcome.

Update - September 14, 2009

I’m somewhat embarrassed about my “I still used regular expressions in a few places” comment. I’ve ranted on Stack Overflow and Meta Stack Overflow about the evils of using regular expressions to parse HTML and XML. Please use regular expressions with care! They are wonderful tools, but can cause maintenance nightmares when misused.

Update - September 19, 2009

I found a link to the Html Agility Pack to LINQ to XML Converter today. I have not tried this yet, but it’s a great idea!


Your Host: webmaster@truewill.net
Copyright © 2000-2013 by William Sorensen. All rights reserved.