在 Python 中提取 XML 标记内的文本（同时避免 <p> 标记）答案

【问题标题】：Extract text inside XML tags with in Python (while avoiding <p> tags)在 Python 中提取 XML 标记内的文本（同时避免 <p> 标记）
【发布时间】：2015-03-23 02:00:59
【问题描述】：

我正在使用 Python 中的 NYT 语料库，并尝试仅提取每个 .xml 文章文件的“full_text”类中的内容。例如：

<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>

理想情况下，我只想解析出字符串，产生“LEAD：两名警察回应报告的抢劫案......”但我不确定最好的方法是什么。这是可以通过正则表达式轻松解析的东西吗？如果是这样，我尝试的任何方法似乎都不起作用。

任何建议将不胜感激！

【问题讨论】：

标签： python regex xml

【解决方案1】：

你也可以使用BeautifulSoup解析器。

>>> from bs4 import BeautifulSoup
>>> s = '''<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>'''
>>> soup = BeautifulSoup(s)
>>> for i in soup.findAll('block', class_="full_text"):
        print(i.text)



LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.

【讨论】：

这很棒！我正在考虑使用漂亮的汤，但无法完全弄清楚该特定类的语法。非常感谢！
但是 findAll 在 jsoup 中不能以这种方式工作，我刚刚尝试并得到错误 invalid character constant for 'block'。你碰巧知道用什么方法吗？
@AvinashRaj voila

【解决方案2】：

这是不是很容易被正则表达式解析？

Dont'!

使用像lxml这样的xml解析器。

ex = """
<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
</body.content>"""

from lxml import etree
ex = etree.fromstring(ex)
print ex.findtext('./block/p')

输出：

LEAD: Two police officers responding to a reported robbery at a 
Brooklyn tavern early yesterday were themselves held up by the robbers, who
took their revolvers and herded them into a back room with patrons, the 
police said.

【讨论】：

谢谢。我不能完全让它在我的代码中工作，但我相信上面的资源是无价的。
@rutrastone 假设是根据您对 XML 的 sn-p 做出的，但是是的，xpath 可能会根据您正在使用的整个文档而改变。