【问题标题】:Extracting text from XML using python使用python从XML中提取文本
【发布时间】:2022-01-13 23:18:15
【问题描述】:

我有这个示例 xml 文件

<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>

我喜欢提取标题标签和内容标签的内容。

哪种方法提取数据好,使用模式匹配还是使用xml模块。或者有没有更好的方法来提取数据。

【问题讨论】:

    标签: python xml


    【解决方案1】:

    已经有一个内置的 XML 库,特别是 ElementTree。例如:

    >>> from xml.etree import cElementTree as ET
    >>> xmlstr = """
    ... <root>
    ... <page>
    ...   <title>Chapter 1</title>
    ...   <content>Welcome to Chapter 1</content>
    ... </page>
    ... <page>
    ...  <title>Chapter 2</title>
    ...  <content>Welcome to Chapter 2</content>
    ... </page>
    ... </root>
    ... """
    >>> root = ET.fromstring(xmlstr)
    >>> for page in list(root):
    ...     title = page.find('title').text
    ...     content = page.find('content').text
    ...     print('title: %s; content: %s' % (title, content))
    ...
    title: Chapter 1; content: Welcome to Chapter 1
    title: Chapter 2; content: Welcome to Chapter 2
    

    【讨论】:

    • @SudeepKodavati:如果您认为圣诞老人的回答令您满意,请“接受”他的回答。
    • 我喜欢这个界面,你可以索引到子标签root[0][1][0]...,以及从任何节点获取一个遍历所有子节点的迭代器! list( root[0][1].itertext() )超级好用!
    • cElementTree 在受支持的 Python (3.3+) 版本上不再需要,请使用 ElementTree
    【解决方案2】:

    代码:

    from xml.etree import cElementTree as ET
    
    tree = ET.parse("test.xml")
    root = tree.getroot()
    
    for page in root.findall('page'):
        print("Title: ", page.find('title').text)
        print("Content: ", page.find('content').text)
    

    输出:

    Title:  Chapter 1
    Content:  Welcome to Chapter 1
    Title:  Chapter 2
    Content:  Welcome to Chapter 2
    

    【讨论】:

    • cElementTree 在受支持的 Python (3.3+) 版本上不再需要,请使用 ElementTree
    【解决方案3】:

    你也可以试试这段代码来提取文本:

    from bs4 import BeautifulSoup
    import csv
    
    data ="""<page>
      <title>Chapter 1</title>
      <content>Welcome to Chapter 1</content>
    </page>
    <page>
     <title>Chapter 2</title>
     <content>Welcome to Chapter 2</content>
    </page>"""
    
    soup = BeautifulSoup(data, "html.parser")
    
    ########### Title #############
    required0 = soup.find_all("title")
    title = []
    for i in required0:
        title.append(i.get_text())
    
    ########### Content #############
    required0 = soup.find_all("content")
    content = []
    for i in required0:
        content.append(i.get_text())
    
    doc1 = list(zip(title, content))
    for i in doc1:
        print(i)
    

    输出:

    ('Chapter 1', 'Welcome to Chapter 1')
    ('Chapter 2', 'Welcome to Chapter 2')
    

    【讨论】:

      【解决方案4】:

      我个人更喜欢像这样使用xml.dom.minidom 进行解析:

      In [18]: import xml.dom.minidom
      
      In [19]: x = """\
      <root><page>
        <title>Chapter 1</title>
        <content>Welcome to Chapter 1</content>
      </page>
      <page>
       <title>Chapter 2</title>
       <content>Welcome to Chapter 2</content>
      </page></root>"""
      
      In [28]: doc = xml.dom.minidom.parseString(x)
      In [29]: doc.getElementsByTagName("page")
      Out[30]: [<DOM Element: page at 0x94d5acc>, <DOM Element: page at 0x94d5c8c>]
      
      In [32]: [p.firstChild.wholeText for p in doc.getElementsByTagName("title") if p.firstChild.nodeType == p.TEXT_NODE]
      Out[33]: [u'Chapter 1', u'Chapter 2']
      
      In [34]: [p.firstChild.wholeText for p in doc.getElementsByTagName("content") if p.firstChild.nodeType == p.TEXT_NODE]
      Out[35]: [u'Welcome to Chapter 1', u'Welcome to Chapter 2']
      
      In [36]: for node in doc.childNodes:
                   if node.hasChildNodes:
                       for cn in node.childNodes:
                           if cn.hasChildNodes:
                               for cn2 in cn.childNodes:
                                   if cn2.nodeType == cn2.TEXT_NODE:
                                       print cn2.wholeText
      Out[37]: Chapter 1
               Welcome to Chapter 1
               Chapter 2
               Welcome to Chapter 2
      

      【讨论】:

      • @qed root 和 doc 在这种情况下是一样的。我更新了代码。
      【解决方案5】:

      向您推荐一个简单的库。这是一个例子:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

      from simplified_scrapy.simplified_doc import SimplifiedDoc
      html ='''
      <page>
        <title>Chapter 1</title>
        <content>Welcome to Chapter 1</content>
      </page>
      <page>
       <title>Chapter 2</title>
       <content>Welcome to Chapter 2</content>
      </page>'''
      doc = SimplifiedDoc(html)
      pages = doc.pages
      print ([(page.title.text,page.content.text) for page in pages])
      

      结果:

      [('Chapter 1', 'Welcome to Chapter 1'), ('Chapter 2', 'Welcome to Chapter 2')]
      

      【讨论】:

        【解决方案6】:

        对于处理(导航、搜索和修改)XML 或 HTML 数据,我发现 BeautifulSoup 库非常有用。安装问题或详细信息,请点击link

        要查找属性(标签)或多属性值:

        from bs4 import BeautifulSoup
        data = """<?xml version="1.0" encoding="UTF-8"?>
        <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
        
        <pdf2xml producer="poppler" version="0.48.0">
        <page number="1" position="absolute" top="0" left="0" height="1188" width="918">
        <text top="246" left="135" width="178" height="16" font="1">PALS SOCIETY OF 
        CANADA</text>
        <text top="261" width="86" height="16" font="1">13479 77 AVE</text>
        </page>
        </pdf2xml>"""
        
        soup = BeautifulSoup(data, "lxml")
        page_tag = soup.find_all('page')
        details_tag = page_tag[0].find_all('text')
        details_tag_count = len(details_tag)
        for iter_text in range(details_tag_count):
            print("Text : ", details_tag[iter_text].text)
            print("Left tag : ", details_tag[iter_text].get("left"))
        

        输出:

        Text :  PALS SOCIETY OF CANADA
        Left tag :  135
        Text :  13479 77 AVE
        Left tag :  None
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2012-07-02
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多