Sunday 30 October 2016

python - Regex include line breaks




I have the following xml file




A




B
C




D




Picture number 3?




and I just want to get the text between

and
.
So I've tried this code :



import os, re


html = open("2.xml", "r")
text = html.read()
lon = re.compile(r'
\n(.+)\n
', re.MULTILINE)
lon = lon.search(text).group(1)
print lon


but It doesn't seem to work.


Answer



1) Don't parse XML with regex. It just doesn't work. Use an XML parser.




2) If you do use regex for this, you don't want re.MULTILINE, which controls how ^ and $ work in a multiple-line string. You want re.DOTALL, which controls whether . matches \n or not.



3) You probably also want your pattern to return the shortest possible match, using the non-greedy +? operator.



lon = re.compile(r'
\n(.+?)\n
', re.DOTALL)

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...