August 12, 2008
@ 10:17 PM

I ran across a really cool .NET library on a recent project I've been working on. We have an internal website where we post news, documentation, etc. - basically a Content Management System (CMS). We're working on a new set of documentation that is being done inside of a third party help builder application. We need to import the HTML files it generates into our website (so we get all the things it offers, like security, searching, revision tracking, view statistics, etc.). So basically, I need to run through a lot of HTML files, build a tree of the documents (similar to the help file) and rewrite all of the URL's and image links to point to the correct URL inside of the site. I initially started looking at various regular expressions that I might be able to use over at http://regexlib.com/. Almost every single one of them had some comment about it failing under some circumstances. The HTML is surprisingly clean, but I was still nervous about it. So I looked at using GOLD to parse the HTML. However, from some of the comments I found it still didn't make everything as easy I would have liked. I finally ran across HtmlAgilityPack over on CodePlex . It's a .NET library which lets you read AND write changes to an HTML file via a simple API.

Here's a chunk of code from my importer so you can get a feel for how it works:

HtmlDocument doc = new HtmlDocument();

doc.Load(content.FullDocumentPath);

HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/@href");

 

Content match = null;

// Run only if there are links in the document.

if (linkNodes != null)

{

    // Fix up the URL's

    foreach (HtmlNode linkNode in linkNodes)

    {

        HtmlAttribute attrib = linkNode.Attributes["href"];

        // If it's an internal page anchor, ignore it

        if (attrib.Value.StartsWith("#"))

            continue;

 

        string path = this.GetAbsolutePath(content.DocumentLink, attrib.Value);

        match = this.m_contentList.Find(p => p.DocumentLink == path);

 

        if (match != null)

            attrib.Value = match.GetUrl();

        else if (!path.ToLower().StartsWith("http://") && !path.ToLower().StartsWith("mailto:"))

            Console.WriteLine("Cannot find matching document, searched for " + path);                       

    }

}

 

Basically, doc.DocumentNode.SelectNodes("//a/@href") returns a collection of links in the document (it uses XPath syntax for the selection string). From there, I just iterate through them, build the new URL, then save the modified Url via code that just does: linkNode.Attributes["href"].Value = "New URL Here". I also needed to strip out all the script tags inside of the document, so it uses similar syntax:

private void StripOutScripts(HtmlDocument doc)

{

    // Strip out the scripts

    HtmlNodeCollection scriptNodes = doc.DocumentNode.SelectNodes("//script");

    if (scriptNodes != null)

    {

        foreach (HtmlNode scriptNode in scriptNodes)

        {

            scriptNode.ParentNode.RemoveChild(scriptNode, false);

        }

    }

}

 

I do the same sort of thing - iterate over the collection, except this time tell it to remove the nodes from the document (note that I'm grabbing the parent node, since the current node is everything contained within the script, excluding the <script> tags. By getting the parent, we get that and the tags themselves.

Each collection has a WriteContentTo() method which can write the HTML for that section of the document to a Stream. What's really nice about this entire library (besides how simple it was to use) was the fact that it doesn't seem to mangle the existing HTML when using WriteContentTo() (at least from what I've seen). Only one minor complaint - the docs are a bit weak. It just includes the standard documentation of the classes, not much in the way of examples. However, it's pretty consistent so it doesn't take much to get started with it.

What a great library - it couldn't be simpler. It saved me a ton of time.

Links:

http://www.codeplex.com/htmlagilitypack
http://regexlib.com/
http://www.devincook.com/goldparser/


 
Name
E-mail
(will show your gravatar icon)
Home page

Comment (Some html is allowed: a@href@title, b, i, strike) where the @ means "attribute." For example, you can use <a href="" title=""> or <blockquote cite="Scott">.  

Enter the code shown (prevents robots):

Live Comment Preview