I have the following xml file
A
B
C
D
Picture number 3?
and I just want to get the text between and
.
So I've tried this code :
import os, re
html = open("2.xml", "r")
text = html.read()
lon = re.compile(r'\n(.+)\n', re.MULTILINE)
lon = lon.search(text).group(1)
print lon
but It doesn't seem to work.
Answer
1) Don't parse XML with regex. It just doesn't work. Use an XML parser.
2) If you do use regex for this, you don't want re.MULTILINE
, which controls how ^
and $
work in a multiple-line string. You want re.DOTALL
, which controls whether .
matches \n
or not.
3) You probably also want your pattern to return the shortest possible match, using the non-greedy +?
operator.
lon = re.compile(r'\n(.+?)\n', re.DOTALL)
No comments:
Post a Comment