How do I parse HTML in C# using Regular Expressions

I ran into an interesting problem yesterday. All I wanted to do was some simple tokenizing of a partial HTML document, an HTML fragment (it was article content from the website I work on) based on certain elements. At first it seemed like a really simple string manipulation thing, but quickly blew out of proportion into something crazy.

In short, never ever never use Regexes or anything else to parse HTML. To do so is to descend into madness. HTML is not simple text, it has structure. Even simple fragments have structure to them that cannot be identified simply through token parsing.

Always always always use an HTML parser, such as the Html Agility Pack. That is what I did. The parser created the document and tokenized everything for me and from there I was able to parse and manipulate the node tree to match the requirements we had.

Never use anything else if you can avoid it.

Leave a Reply

Your email address will not be published. Required fields are marked *