A day in the life (of a developer) RSS 2.0
 Tuesday, August 12, 2008

I ran across a really cool .NET library on a recent project I've been working on. We have an internal website where we post news, documentation, etc. - basically a Content Management System (CMS). We're working on a new set of documentation that is being done inside of a third party help builder application. We need to import the HTML files it generates into our website (so we get all the things it offers, like security, searching, revision tracking, view statistics, etc.). So basically, I need to run through a lot of HTML files, build a tree of the documents (similar to the help file) and rewrite all of the URL's and image links to point to the correct URL inside of the site. I initially started looking at various regular expressions that I might be able to use over at http://regexlib.com/. Almost every single one of them had some comment about it failing under some circumstances. The HTML is surprisingly clean, but I was still nervous about it. So I looked at using GOLD to parse the HTML. However, from some of the comments I found it still didn't make everything as easy I would have liked. I finally ran across HtmlAgilityPack over on CodePlex . It's a .NET library which lets you read AND write changes to an HTML file via a simple API.

Here's a chunk of code from my importer so you can get a feel for how it works:

HtmlDocument doc = new HtmlDocument();

doc.Load(content.FullDocumentPath);

HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/@href");

 

Content match = null;

// Run only if there are links in the document.

if (linkNodes != null)

{

    // Fix up the URL's

    foreach (HtmlNode linkNode in linkNodes)

    {

        HtmlAttribute attrib = linkNode.Attributes["href"];

        // If it's an internal page anchor, ignore it

        if (attrib.Value.StartsWith("#"))

            continue;

 

        string path = this.GetAbsolutePath(content.DocumentLink, attrib.Value);

        match = this.m_contentList.Find(p => p.DocumentLink == path);

 

        if (match != null)

            attrib.Value = match.GetUrl();

        else if (!path.ToLower().StartsWith("http://") && !path.ToLower().StartsWith("mailto:"))

            Console.WriteLine("Cannot find matching document, searched for " + path);                       

    }

}

 

Basically, doc.DocumentNode.SelectNodes("//a/@href") returns a collection of links in the document (it uses XPath syntax for the selection string). From there, I just iterate through them, build the new URL, then save the modified Url via code that just does: linkNode.Attributes["href"].Value = "New URL Here". I also needed to strip out all the script tags inside of the document, so it uses similar syntax:

private void StripOutScripts(HtmlDocument doc)

{

    // Strip out the scripts

    HtmlNodeCollection scriptNodes = doc.DocumentNode.SelectNodes("//script");

    if (scriptNodes != null)

    {

        foreach (HtmlNode scriptNode in scriptNodes)

        {

            scriptNode.ParentNode.RemoveChild(scriptNode, false);

        }

    }

}

 

I do the same sort of thing - iterate over the collection, except this time tell it to remove the nodes from the document (note that I'm grabbing the parent node, since the current node is everything contained within the script, excluding the <script> tags. By getting the parent, we get that and the tags themselves.

Each collection has a WriteContentTo() method which can write the HTML for that section of the document to a Stream. What's really nice about this entire library (besides how simple it was to use) was the fact that it doesn't seem to mangle the existing HTML when using WriteContentTo() (at least from what I've seen). Only one minor complaint - the docs are a bit weak. It just includes the standard documentation of the classes, not much in the way of examples. However, it's pretty consistent so it doesn't take much to get started with it.

What a great library - it couldn't be simpler. It saved me a ton of time.

Links:

http://www.codeplex.com/htmlagilitypack
http://regexlib.com/
http://www.devincook.com/goldparser/

Tuesday, August 12, 2008 9:17:44 PM (Eastern Standard Time, UTC-05:00)  #    Comments [0] -

.NET | C# | Developer Tools
 Wednesday, November 14, 2007

This falls into the "hey, I didn't realize this was available" category. In the 2.0 version of the .NET framework, support was added for nullable types, I'm guessing primarily for improved interaction with databases. Previously, nulls were a real pain to deal with; now they're only a bit more work (the pains not completely gone, but it's not as bad anymore). Lets say you've got a date time field in your database, it can have an actual date filled in or it can be null. In prior versions, you had to rely on "magic" dates to do translations when loading and saving your data (eg. 1/1/1900 is really null). Now you can just define a nullable type by prefixing the type with a question mark ?

 

DateTime? activeStart = (DateTime?)row["ActiveStart"];

 

That's not the point of this post though. They've added a nice operator ?? - so let's suppose I've got a calendar control that returns null if no date has been selected. I'd like to default to the current date if that's the case. I can write code like this:

 

DateTime start = System.DateTime.Now;

if (this.startCalendar.SelectedDate != null)
   start = this.startCalendar.SelectedDate;

 

Or I can use the ?? operator:

 

DateTime start = this.startCalendar.SelectedDate ?? DateTime.Now;

 

That's much simpler code to understand (at least to me). I like this a bit better than the normal ternary "if" support, which I always find a bit difficult to read:

string sample = a==b ? "A=B" : "A<>B" 

Maybe it will grow on me eventually...
Wednesday, November 14, 2007 10:11:38 PM (Eastern Standard Time, UTC-05:00)  #    Comments [0] -

C#


Navigation
Archive
<October 2008>
SunMonTueWedThuFriSat
2829301234
567891011
12131415161718
19202122232425
2627282930311
2345678
About the author/Disclaimer

Disclaimer
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

© Copyright 2008
Paul Mrozowski / RCS Solutions, Inc.
Sign In
Statistics
Total Posts: 57
This Year: 32
This Month: 0
This Week: 0
Comments: 21
All Content © 2008, Paul Mrozowski / RCS Solutions, Inc.
DasBlog theme 'Business' created by Christoph De Baene (delarou)