I wrote a web site scraping tool recently; the tool’s very specific, but I learned some general tips in the process.
The WebBrowser control is handy for displaying an HTML report. Assign the HTML text to its DocumentText property. If the HTML contains links, you can have these open in your default browser by handling the Navigating event:
private void webResults_Navigating(...)
{
// If the user clicks on a link, open it in the default browser.
// Setting DocumentText directly will send a URL of about:blank.
if (e.Url.ToString() != "about:blank")
{
System.Diagnostics.Process.Start(e.Url.ToString());
e.Cancel = true;
}
}
The WebClient class does not appear to work properly with cookies. If you need to send/receive cookies, use the WebRequest class instead:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = _UserAgent;
request.CookieContainer = cookies;
using (WebResponse response = request.GetResponse())
{
using (Stream responseStream = response.GetResponseStream())
{
using (StreamReader reader = new StreamReader(responseStream))
{
html = reader.ReadToEnd();
}
}
}
I copied the _UserAgent static string contents from Help/About in Firefox. Cookies to send can also be copied from Firefox (but note that they can expire); create ones to send as follows:
CookieContainer cookies = new CookieContainer();
Cookie c = new Cookie(_AuthenticationCookieName,
_AuthenticationCookieValue, "/", _Domain);
cookies.Add(c);
The HtmlAgilityPack saved me a huge amount of time parsing retrieved web pages. Sure, you could do this with regular expressions, but why? HtmlAgilityPack lets you navigate HTML like XML, using XPath. Available for free at http://www.codeplex.com/htmlagilitypack. I find the license unclear; I think it’s non-viral for collective works, but I’m not positive.
(I still used regular expressions in a few places; they’re just too handy.)
The Firefox DOM Inspector add-on was indispensable in analyzing the pages to scrape.
This all works well, but it took considerable time to write and test. It made me wish for a “browser macro recorder” similar to MS Word’s; something that would watch my activities and record them as script code. The code could then be edited and replayed. Suggestions are welcome.
Update - September 14, 2009
I’m somewhat embarrassed about my “I still used regular expressions in a few places” comment. I’ve ranted on Stack Overflow and Meta Stack Overflow about the evils of using regular expressions to parse HTML and XML. Please use regular expressions with care! They are wonderful tools, but can cause maintenance nightmares when misused.
Update - September 19, 2009
I found a link to the Html Agility Pack to LINQ to XML Converter today. I have not tried this yet, but it’s a great idea!
I mentioned that I liked Google Reader recently; that’s still true. I think it’s one of the best Web-based applications I’ve used, and it works well for reading RSS feeds.
There’s some recent news about a privacy issue, though: Google Reader Becomes Holiday Snitch
Here’s Google’s response: Managing your shared items
By default, Reader items are not shared. My initial reaction was that people overreacted; shared=public. After a brief look at the documentation, though, it’s not clear just how public shared items are. I can see how this could upset some people.
I wouldn’t advise using Google Reader (or just about any public Web application) for any data that must be kept private. I intend to keep using the service, as I could safely share my feeds with anyone at work.
I read some amazing code today. From an OOP perspective, it was extremely well designed - small classes, short methods, little duplication, encapsulated, polymorphic, etc. It was also more sophisticated than anything that I’ve written recently; I learned some techniques from reading it.
There was only one problem with it - it had no comments.
Code can and should reveal intent through good design and careful naming. Comments in the body of a method are often a code smell. The same is not true for class and method header comments, though.
Especially with code that uses design patterns or advanced language features, comments are a must. Without them, we perpetuate the priesthood; only the guru can maintain his or her own code, as no one else can understand it.
Once martial artists reach a certain level, they are asked to teach less advanced students. Developers could benefit from this attitude. Those with other views should be forced to maintain their own code for eternity. ;-)
I stress quality over and over at work, but it struck me today that maintainability is the true goal.
I can write code of impeccable quality; every class carefully designed, every method refactored to the nth degree. If it’s hard to change when change is required, or if no one can understand it but me, it’s not very maintainable.
The two do go hand in hand; if you have to choose, though, pick maintainability.
Both are easy to sacrifice in the name of expedience. This leads to technical debt.
Apologies don’t pay down technical debt. Placing blame doesn’t help. If you’re working with the code, refactoring is your responsibility. Just do it.
(At least in a shared code environment.)
Karl Seguin keeps getting better. His latest in the Foundations of Programming series is on unit testing. The “Why wasn’t I unit testing 3 years ago?” section is a must-read.
Foundations of Programming - Part 5 – Unit Testing
I’m particularly pleased that he likes Rhino Mocks, too.
So what’s your excuse for not unit testing? :-)
Foundations of Programming on Karl Seguin’s blog
“I used to be confident in my programming skill, but only once I accepted that I knew very little, and likely always would, did I start to actually understand.” - Karl Seguin
This link told me a lot about what I was doing wrong; my XSL is much better now!
http://www.deez.info/sengelha/blog/2006/02/27/disabling-default-xslt-templates/
I’ve never had good luck finding an open source XSL editor for Windows - until now. I finally tried searching on SourceForge, and found XTrans. It’s intuitive, and seems to work as well as (or better than) the underlying MSXML engine. It includes a handy XPath query tester, too.
The program assumes you know XSL; it will give you a list of tags with short descriptions, but that’s about it. Still, it’s a major timesaver.
I know there are several expensive commercial offerings, but I don’t write XSL often enough to justify that.
Personally, I don’t care for XSL. I find it confusing and difficult to debug. I could write a C# program to parse the XML and create the output in less time than it takes to develop a transformation. XPath, on the other hand, is a wonderful timesaver.
On The Meaning of “Coding Horror”
“You’re an amateur developer until you realize that everything you write sucks.” - Jeff Atwood
[Standing ovation]
I have tried a couple of RSS feed readers in the past, and was not overly impressed. I did not put much time into the search; the idea was to save time, after all.
Recently I tried Google Reader, and I’ve been favorably impressed.
Pros:
The only con I’ve found is the privacy concern. Mine are all tech feeds, so that’s not an issue for me.
Be sure to add the “Subscribe as you search” bookmark to your toolbar; it’s under Settings/Goodies in the Reader.
Free and worth a try, if you don’t have a favorite reader already.
Note - please see the update on 12/27/2007.
I was a big fan of Borland’s Delphi, but I used to rail against the company for placing features in their Enterprise and Architect editions seemingly at random. I felt they were pricing key features beyond the reach of individual developers and small companies.
Microsoft has done one better; they’ve priced features beyond the reach of mid-sized companies.
(Prices are retail for the full version.)
For a mid-sized corporation with 20 developers, the difference in initial investment between the Professional and Team Editions is $93,400. Think you can get that in this year’s budget?
And what does Microsoft consider high-end features? Try integrated Unit Testing, Code Coverage, Static Code Analysis, and Profiling.
Although Microsoft only offers these options with the Team Editions, developers can obtain alternatives for all of them:
NUnit is particularly excellent; compare it to the latest JUnit if you don’t believe me. FxCop is great, too; there’s a new beta out that I haven’t tried.
That said, I’d love to have top-notch integrated tools. I’ve seen a Microsoft presenter demo the integration, and it’s slick. The metrics for 2008 look even better; see http://blogs.msdn.com/fxcop/.
But if I were a manager, I’d be hard-pressed to justify the expense.
http://www.eweek.com/article2/0,1895,2219167,00.asp
Thank you, Mr. Somasegar!
If only more managers understood the concept of Technical Debt …
I’ve been trying to come up with a good way to test classes in our domain layer at work. Standard unit testing dogma is that tests should be fast and isolated. Testing against a database is not a good way to meet those criteria.
The problem is that we use persistence frameworks (O/R mappers). For Delphi we rolled our own; for .NET, we use Gentle.NET. Persisting business objects becomes very easy; however, the objects depend on the persistence layer.
So how do you mock the database? One way is to use a local or in-memory database. We’ve successfully used SQL Server Express to set up local file-based databases for testing. These can be treated as test decks, checked into source control, and restored to an initial state as needed. The main problem is that they’re a lot of work to set up, and more work to maintain as the database schema evolves.
I’m not sure what the solution is. I’m familiar with mock objects and dependency injection, but I don’t know where the interfaces belong. Suggestions are very welcome.
Bugs happen.
Bugs plague restaurants, too. But how often would you eat at one where you regularly found flies in your soup?
If the customer sees a bug, there’s a problem.
Update - December 5, 2007
One of my coworkers (Wally) mentioned that the restaurant workers don’t want the customer to eat the soup, either.
The Microsoft guidelines recommend allowing an application to terminate if it does not know how to handle an exception. When an unhandled exception bubbles all the way up the stack, you have no idea what the state of your application is.
Three things I hate: bugs, debugging, and manual testing.
Years ago, I built a reputation for writing (relatively) bug-free code. My methodology was:
This worked. The problem was that changing a tested procedure required repeating the manual tests. I hated maintenance, and would argue against feature requests that affected the existing design. Unfortunately, maintenance and change requests are a large percentage of our jobs.
At a Borland Convention one year, I went to a seminar on reducing bugs in Delphi applications or some such. (I don’t remember who presented it; an Australian fellow, I think.) The speaker spent some time on Unit Testing and test frameworks; I was intrigued.
Eventually, I found that automated tests kept me out of the debugger. I no longer hated or feared changes.
Reading Beck (XP) and Fowler (Refactoring) taught me that complex code is hard to change and prone to bugs, while simple code is easy to change and debug.
I still write stable code. I just enjoy it more.
Development Central is the blog of Bill Sorensen, a professional software developer. Much of this will relate to C#, .NET, and OOP in general.
Disclaimer
These postings are provided "AS IS" with no warranties and confer no rights.