【问题标题】:Extract text inside XML tags with in Python (while avoiding <p> tags)在 Python 中提取 XML 标记内的文本(同时避免 <p> 标记)
【发布时间】:2015-03-23 02:00:59
【问题描述】:

我正在使用 Python 中的 NYT 语料库,并尝试仅提取每个 .xml 文章文件的“full_text”类中的内容。例如:

<body.content>
      <block class="lead_paragraph">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>
      <block class="full_text">
        <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
      </block>

理想情况下,我只想解析出字符串,产生“LEAD:两名警察回应报告的抢劫案......”但我不确定最好的方法是什么。这是可以通过正则表达式轻松解析的东西吗?如果是这样,我尝试的任何方法似乎都不起作用。

任何建议将不胜感激!

【问题讨论】:

    标签: python regex xml


    【解决方案1】:

    你也可以使用BeautifulSoup解析器。

    >>> from bs4 import BeautifulSoup
    >>> s = '''<body.content>
          <block class="lead_paragraph">
            <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
          </block>
          <block class="full_text">
            <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
          </block>'''
    >>> soup = BeautifulSoup(s)
    >>> for i in soup.findAll('block', class_="full_text"):
            print(i.text)
    
    
    
    LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.
    

    【讨论】:

    • 这很棒!我正在考虑使用漂亮的汤,但无法完全弄清楚该特定类的语法。非常感谢!
    • 但是 findAll 在 jsoup 中不能以这种方式工作,我刚刚尝试并得到错误 invalid character constant for 'block'。你碰巧知道用什么方法吗?
    • @AvinashRaj voila
    【解决方案2】:

    这是不是很容易被正则表达式解析?

    Dont'!

    使用像lxml这样的xml解析器。

    ex = """
    <body.content>
          <block class="lead_paragraph">
            <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
          </block>
          <block class="full_text">
            <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
          </block>
    </body.content>"""
    
    from lxml import etree
    ex = etree.fromstring(ex)
    print ex.findtext('./block/p')
    

    输出:

    LEAD: Two police officers responding to a reported robbery at a 
    Brooklyn tavern early yesterday were themselves held up by the robbers, who
    took their revolvers and herded them into a back room with patrons, the 
    police said.
    

    【讨论】:

    • 谢谢。我不能完全让它在我的代码中工作,但我相信上面的资源是无价的。
    • @rutrastone 假设是根据您对 XML 的 sn-p 做出的,但是是的,xpath 可能会根据您正在使用的整个文档而改变。
    猜你喜欢
    • 2020-05-28
    • 2012-09-21
    • 1970-01-01
    • 2016-08-13
    • 1970-01-01
    • 2010-10-29
    • 1970-01-01
    • 1970-01-01
    • 2019-06-15
    相关资源
    最近更新 更多