I need to parse my HTML page to replace some links, this is the form of a link Mauris nec
. The problem is that my regex expression doesn't end properly, I think it's because of the ".
This is my Regex expression :
Regex r= new Regex("(.*)");
That regex doesn't end after each link, and the third group doesn't contain the title property but almost all the html until the last of my html.
I tested it with this site :
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
So, why doesn't the third group end directly after Bas-Rhin"
?
Answer
Regex r= new Regex("(.*)");
doesn't work as expected because quantifiers (*
) are greedy by default, that means they catch all they can (the most possible).
To solve the problem, you have several ways:
1 the most obvious:
make your quantifiers lazy by adding a question mark: (.*?)
2 the most efficient:
don't use the dot and use a negated character class instead. Example:
Regex r= new Regex("(.*?)");
The last (.*?)
can be replaced by:
((?>[^<]+|<(?!/a>)*)
3 the most reasonable:
use agilitypack or an other html parser to extract all "a" tags. you can check after if the href is like you want. (Note that with xpath you can perform this check directly in one step)
Xpath query example:
//a[contains(@href, '{localLink:')]
No comments:
Post a Comment