Parsing HTML in .NET
Parsing XML in .NET is well supported by the Framework’s XmlDocument and XPathDocument classes, but unfortuantely there’s no similar built-in provision for parsing HTML. If you’ve ever tried to parse HTML by hand, maybe using string manipulation or regular expressions, then you’ll know just how frustrating and difficult it can be.
Luckily, Simon Mourier’s excellent Html Agility Pack solves this problem. It’s one of the most useful .NET libraries I’ve come across, and will parse even the sloppiest HTML into an orderly structure of objects, similar to the XML DOM.
The pack is bundled with full source C# code, API documentation, and samples showing how to convert HTML into plain text, RSS and XML.
Here’s a short code example to show it in action:
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load("http://www.bbc.co.uk/");
HtmlNodeCollection links =
doc.DocumentNode.SelectNodes("//a[@href]");
foreach (HtmlNode link in links)
{
Response.Write(link.Attributes["href"].Value + "<br>");
}
There are loads of great applications for HTML parsing, here are just a few: