html - RegEx match open tags except XHTML self-contained tags

Wednesday, 23 November 2016

html - RegEx match open tags except XHTML self-contained tags

I agree that the right tool to parse XML and especially HTML is a parser and not a regular expression engine. However, like others have pointed out, sometimes using a regex is quicker, easier, and gets the job done if you know the data format.

Microsoft actually has a section of Best Practices for Regular Expressions in the .NET Framework and specifically talks about Consider[ing] the Input Source.

Regular Expressions do have limitations, but have you considered the following?

The .NET framework is unique when it comes to regular expressions in that it supports Balancing Group Definitions.

For this reason, I believe you CAN parse XML using regular expressions. Note however, that it must be valid XML (browsers are very forgiving of HTML and allow bad XML syntax inside HTML). This is possible since the "Balancing Group Definition" will allow the regular expression engine to act as a PDA.

Quote from article 1 cited above:

.NET Regular Expression Engine

As described above properly balanced constructs cannot be described by
a regular expression. However, the .NET regular expression engine
provides a few constructs that allow balanced constructs to be
recognized.

(?) - pushes the captured result on the capture stack with
the name group.

(?<-group>) - pops the top most capture with the name group off the
capture stack.

(?(group)yes|no) - matches the yes part if there exists a group
with the name group otherwise matches no part.

These constructs allow for a .NET regular expression to emulate a
restricted PDA by essentially allowing simple versions of the stack
operations: push, pop and empty. The simple operations are pretty much
equivalent to increment, decrement and compare to zero respectively.
This allows for the .NET regular expression engine to recognize a
subset of the context-free languages, in particular the ones that only
require a simple counter. This in turn allows for the non-traditional
.NET regular expressions to recognize individual properly balanced
constructs.

Consider the following regular expression:

(?=)
(?>
                     |
   <[^>]*/>                      |
   (?<(?!/)[^>]*[^/]>)  |
   (?<-opentag>]*[^/]>)     |
   [^<>]*
)*
(?(opentag)(?!))

Use the flags:

Singleline

IgnorePatternWhitespace (not necessary if you collapse regex and remove all whitespace)

IgnoreCase (not necessary)

Regular Expression Explained (inline)

(?=) # match start with (?>                                        # atomic group / don't backtrack (faster)
                    |          # match xml / html comment
   <[^>]*/>                     |          # self closing tag
   (?<(?!/)[^>]*[^/]>) |          # push opening xml tag
   (?<-opentag>]*[^/]>)    |          # pop closing xml tag
   [^<>]*                                  # something between tags
)*                                         # match as many xml tags as possible
(?(opentag)(?!))                           # ensure no 'opentag' groups are on stack

You can try this at A Better .NET Regular Expression Tester.

I used the sample source of:




   
   
      stuff...

      more stuff

      
          
               still more
               
                    Another >ul<, oh my!

                    ...

This found the match:

   
      stuff...

      more stuff

      
          
               still more
               
                    Another >ul<, oh my!

                    ...

although it actually came out like this:

           stuff...
           more stuff
                                              still more                                             Another >ul<, oh my!
                         ...

Lastly, I really enjoyed Jeff Atwood's article: Parsing Html The Cthulhu Way. Funny enough, it cites the answer to this question that currently has over 4k votes.

Blog

Wednesday, 23 November 2016

html - RegEx match open tags except XHTML self-contained tags

Regular Expression Explained (inline)

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?