I need to parse a xml file to extract some data.
I only need some elements with certain attributes, here's an example of document:
some text
some text
some text
Here I would like to get only the article with the type "news".
What's the most efficient and elegant way to do it with lxml?
I tried with the find method but it's not very nice:
from lxml import etree
f = etree.parse("myfile")
root = f.getroot()
articles = root.getchildren()[0]
article_list = articles.findall('article')
for article in article_list:
if "type" in article.keys():
if article.attrib['type'] == 'news':
content = article.find('content')
content = content.text
Answer
You can use xpath, e.g. root.xpath("//article[@type='news']")
This xpath expression will return a list of all
To get just the text content, you can extend the xpath like so:
root = etree.fromstring("""
some text
some text
some text
""")
print root.xpath("//article[@type='news']/content/text()")
and this will output ['some text', 'some text']
. Or if you just wanted the content elements, it would be "//article[@type='news']/content"
-- and so on.
No comments:
Post a Comment