Wednesday 19 October 2016

Python, regex and html: match final tag on line



I'm confused about python greedy/not-greedy characters.



"Given multi-line html, return the final tag on each line."




I would think this would be correct:



re.findall('<.*?>$', html, re.MULTILINE)


I'm irked because I expected a list of single tags like:



"", "
    ", "".



My O'Reilly's Pocket Reference says that *? wil "match 0 or more times, but as few times as possible."



So why am I getting 'greedier' matches, i.e., more than one tag in some (but not all) matches?


Answer



Your problem stems from the fact that you have an end-of-line anchor ('$'). The way non-greedy matching works is that the engine first searches for the first unconstrained pattern on the line ('<' in your case). It then looks for the first '>' character (which you have constrained, with the $ anchor, to be at the end of the line). So a non-greedy * is not any different from a greedy * in this situation.



Since you cannot remove the '$' from your RE (you are looking for the final tag on a line), you will need to take a different tack...see @Mark's answer. '<[^><]*>$' will work.


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...