A day in the life (of a developer) RSS 2.0
 Tuesday, August 12, 2008

I ran across a really cool .NET library on a recent project I've been working on. We have an internal website where we post news, documentation, etc. - basically a Content Management System (CMS). We're working on a new set of documentation that is being done inside of a third party help builder application. We need to import the HTML files it generates into our website (so we get all the things it offers, like security, searching, revision tracking, view statistics, etc.). So basically, I need to run through a lot of HTML files, build a tree of the documents (similar to the help file) and rewrite all of the URL's and image links to point to the correct URL inside of the site. I initially started looking at various regular expressions that I might be able to use over at http://regexlib.com/. Almost every single one of them had some comment about it failing under some circumstances. The HTML is surprisingly clean, but I was still nervous about it. So I looked at using GOLD to parse the HTML. However, from some of the comments I found it still didn't make everything as easy I would have liked. I finally ran across HtmlAgilityPack over on CodePlex . It's a .NET library which lets you read AND write changes to an HTML file via a simple API.

Here's a chunk of code from my importer so you can get a feel for how it works:

HtmlDocument doc = new HtmlDocument();

doc.Load(content.FullDocumentPath);

HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/@href");

 

Content match = null;

// Run only if there are links in the document.

if (linkNodes != null)

{

    // Fix up the URL's

    foreach (HtmlNode linkNode in linkNodes)

    {

        HtmlAttribute attrib = linkNode.Attributes["href"];

        // If it's an internal page anchor, ignore it

        if (attrib.Value.StartsWith("#"))

            continue;

 

        string path = this.GetAbsolutePath(content.DocumentLink, attrib.Value);

        match = this.m_contentList.Find(p => p.DocumentLink == path);

 

        if (match != null)

            attrib.Value = match.GetUrl();

        else if (!path.ToLower().StartsWith("http://") && !path.ToLower().StartsWith("mailto:"))

            Console.WriteLine("Cannot find matching document, searched for " + path);                       

    }

}

 

Basically, doc.DocumentNode.SelectNodes("//a/@href") returns a collection of links in the document (it uses XPath syntax for the selection string). From there, I just iterate through them, build the new URL, then save the modified Url via code that just does: linkNode.Attributes["href"].Value = "New URL Here". I also needed to strip out all the script tags inside of the document, so it uses similar syntax:

private void StripOutScripts(HtmlDocument doc)

{

    // Strip out the scripts

    HtmlNodeCollection scriptNodes = doc.DocumentNode.SelectNodes("//script");

    if (scriptNodes != null)

    {

        foreach (HtmlNode scriptNode in scriptNodes)

        {

            scriptNode.ParentNode.RemoveChild(scriptNode, false);

        }

    }

}

 

I do the same sort of thing - iterate over the collection, except this time tell it to remove the nodes from the document (note that I'm grabbing the parent node, since the current node is everything contained within the script, excluding the <script> tags. By getting the parent, we get that and the tags themselves.

Each collection has a WriteContentTo() method which can write the HTML for that section of the document to a Stream. What's really nice about this entire library (besides how simple it was to use) was the fact that it doesn't seem to mangle the existing HTML when using WriteContentTo() (at least from what I've seen). Only one minor complaint - the docs are a bit weak. It just includes the standard documentation of the classes, not much in the way of examples. However, it's pretty consistent so it doesn't take much to get started with it.

What a great library - it couldn't be simpler. It saved me a ton of time.

Links:

http://www.codeplex.com/htmlagilitypack
http://regexlib.com/
http://www.devincook.com/goldparser/

Tuesday, August 12, 2008 9:17:44 PM (Eastern Standard Time, UTC-05:00)  #    Comments [0] -

.NET | C# | Developer Tools
 Wednesday, October 17, 2007

I’ve been trying to get the (highly regarded) CopySourceAsHtml add-in to work under Visual Studio to make it easier to post code samples. Every time I tried loading it, it was failing with:

The Add-in 'CopySourceAsHtml' failed to load or caused an exception.
Would you like to remove this Add-in?
If you choose yes, the file it was loaded from, '\\wtz-srv-dc01\Users\paul.mrozowski\My Documents\Visual Studio 2005\Addins\CopySourceAsHtml.AddIn', will be renamed.
Error Message:
Error number: 8013150a

I tried downloading the source, recompiling, making some suggested changes to the control references. Nothing seemed to help. As soon as I attempted to activate the add-in, it was failing. Then it suddenly occurred to me: what if it’s related to the fact that “My Documents” is on a network drive? To test this idea, I created a new local folder, then when into Tools > Options > Environment > Add-in/Macros Security. I added a new local folder which had a copy of the add-in and exited VS and restarted. Finally, I when to Tools > Add-ins and reselected the CopySourceAsHtml add-in. No errors!

options

I copied and pasted some code (and selected Override for the font to add Courier New as a secondary font). (random code shown below).

 

        /// <summary>

        /// Process events from the grid

        /// </summary>

        /// <param name="source"></param>

        /// <param name="e"></param>

        /// <developer>Paul Mrozowski</developer>

        /// <created>10/16/2007</created>

        protected void grdTemplates_ItemCommand(object source, Telerik.WebControls.GridCommandEventArgs e)

        {           

            if (e.CommandName == "Edit")

                this.EditRecord(e.Item);

            else if (e.CommandName == "Cancel")

                this.grdTemplates.EditIndexes.Clear();

            else if (e.CommandName == "Add")

                this.AddRecord();

            else if (e.CommandName == "Test")

                this.TestQuery(e);

            else if (e.CommandName == "Update")

                this.UpdateRoles(e);

        }

 

Here's a screenshot as a comparison:

 

htmlComparison

 

 

Links:

CopySourceAsHtml

http://www.jtleigh.com/people/colin/software/CopySourceAsHtml/

Getting it to run under VS 2008 Beta 2

 http://diditwith.net/2007/08/16/CopySourceAsHtmlInVisualStudio2008.aspx

Wednesday, October 17, 2007 9:49:58 PM (Eastern Standard Time, UTC-05:00)  #    Comments [1] -

Developer Tools | Visual Studio


Navigation
Archive
<October 2008>
SunMonTueWedThuFriSat
2829301234
567891011
12131415161718
19202122232425
2627282930311
2345678
About the author/Disclaimer

Disclaimer
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

© Copyright 2008
Paul Mrozowski / RCS Solutions, Inc.
Sign In
Statistics
Total Posts: 57
This Year: 32
This Month: 0
This Week: 0
Comments: 21
All Content © 2008, Paul Mrozowski / RCS Solutions, Inc.
DasBlog theme 'Business' created by Christoph De Baene (delarou)