In this post, we are going to see how to parse an HTML web page with C# and read its content easily thanks to the HTML Agility Pack library.
Currently, a large amount of information is transmitted to the user through web pages. Giving our program the ability to do the same is a very useful functionality for automating processes.
There are many cases where we might need to read HTML. For example, we can automatically check the status of an order, or the tracking of a shipment, or the price variation of a product, among an almost infinite number of applications.
Interpreting an HTML text with C# is not too difficult, but we can make it even easier thanks to the HTML Agility Pack library available at https://html-agility-pack.net/.
For now, the library is Open Source and the code is hosted at https://github.com/zzzprojects/html-agility-pack. And we say “for now” because in the past the author has converted some of his Open Source libraries into commercial ones.
With HTML Agility Pack, we can parse the HTML into a tree of nodes. The library includes functions to locate child nodes, or nodes that meet a set of properties, and we can even apply LINQ for searches.
We can load a web page with HTML Agility Pack either from a text file, a string in memory, or directly from the Internet.
// From File
var doc = new HtmlDocument();
doc.Load(filePath);
// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);
// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);
Once the document is loaded, we can get one or several nodes using LINQ.
var node = htmlDoc.DocumentNode.SelectSingleNode("//head/title");
var nodes = doc.DocumentNode.SelectNodes("//article")
Once we have a node, we can use HTML Agility Pack to find its child nodes or read its content, including attributes, name, class, text, etc.
node.Descendants("a").First().Attributes["data-price"].Value
node.Name
node.OuterHtml
node.InnerText
If the node’s content is encoded as HTML (which is normal), we can “clean” it to convert it into “plain” text with the help of the ‘HtmlDecode’ function from the ‘System.Net.WebUtility’ assembly.
System.Net.WebUtility.HtmlDecode(node.Descendants("a").First().InnerText);
However, with HTML Agility Pack we can only read the HTML code of the page, but it does not execute the associated JavaScript. This is a problem with modern dynamic pages, where the initially loaded HTML code (which is sometimes almost empty) is modified by JavaScript.
A possible solution is to load the web page in a WebViewer control, which does execute the page’s scripts, and then parse the WebViewer’s content with HTML Agility Pack.
In any case, it’s a useful tool for many applications, worth having in our inventory of favorite libraries.

