使用python从XML中提取文本答案

【问题标题】：Extracting text from XML using python使用python从XML中提取文本
【发布时间】：2022-01-13 23:18:15
【问题描述】：

我有这个示例 xml 文件

<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>

我喜欢提取标题标签和内容标签的内容。

哪种方法提取数据好，使用模式匹配还是使用xml模块。或者有没有更好的方法来提取数据。

【问题讨论】：

标签： python xml

【解决方案1】：

已经有一个内置的 XML 库，特别是 ElementTree。例如：

>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
...   <title>Chapter 1</title>
...   <content>Welcome to Chapter 1</content>
... </page>
... <page>
...  <title>Chapter 2</title>
...  <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
...     title = page.find('title').text
...     content = page.find('content').text
...     print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2

【讨论】：

@SudeepKodavati：如果您认为圣诞老人的回答令您满意，请“接受”他的回答。
我喜欢这个界面，你可以索引到子标签root[0][1][0]...，以及从任何节点获取一个遍历所有子节点的迭代器！ list( root[0][1].itertext() )超级好用！
cElementTree 在受支持的 Python (3.3+) 版本上不再需要，请使用 ElementTree。

【解决方案2】：

代码：

from xml.etree import cElementTree as ET

tree = ET.parse("test.xml")
root = tree.getroot()

for page in root.findall('page'):
    print("Title: ", page.find('title').text)
    print("Content: ", page.find('content').text)

输出：

Title:  Chapter 1
Content:  Welcome to Chapter 1
Title:  Chapter 2
Content:  Welcome to Chapter 2

【讨论】：

cElementTree 在受支持的 Python (3.3+) 版本上不再需要，请使用 ElementTree。

【解决方案3】：

你也可以试试这段代码来提取文本：

from bs4 import BeautifulSoup
import csv

data ="""<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>"""

soup = BeautifulSoup(data, "html.parser")

########### Title #############
required0 = soup.find_all("title")
title = []
for i in required0:
    title.append(i.get_text())

########### Content #############
required0 = soup.find_all("content")
content = []
for i in required0:
    content.append(i.get_text())

doc1 = list(zip(title, content))
for i in doc1:
    print(i)

输出：

('Chapter 1', 'Welcome to Chapter 1')
('Chapter 2', 'Welcome to Chapter 2')

【讨论】：

【解决方案4】：

我个人更喜欢像这样使用xml.dom.minidom 进行解析：

In [18]: import xml.dom.minidom

In [19]: x = """\
<root><page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page></root>"""

In [28]: doc = xml.dom.minidom.parseString(x)
In [29]: doc.getElementsByTagName("page")
Out[30]: [<DOM Element: page at 0x94d5acc>, <DOM Element: page at 0x94d5c8c>]

In [32]: [p.firstChild.wholeText for p in doc.getElementsByTagName("title") if p.firstChild.nodeType == p.TEXT_NODE]
Out[33]: [u'Chapter 1', u'Chapter 2']

In [34]: [p.firstChild.wholeText for p in doc.getElementsByTagName("content") if p.firstChild.nodeType == p.TEXT_NODE]
Out[35]: [u'Welcome to Chapter 1', u'Welcome to Chapter 2']

In [36]: for node in doc.childNodes:
             if node.hasChildNodes:
                 for cn in node.childNodes:
                     if cn.hasChildNodes:
                         for cn2 in cn.childNodes:
                             if cn2.nodeType == cn2.TEXT_NODE:
                                 print cn2.wholeText
Out[37]: Chapter 1
         Welcome to Chapter 1
         Chapter 2
         Welcome to Chapter 2

【讨论】：

@qed root 和 doc 在这种情况下是一样的。我更新了代码。

【解决方案5】：

向您推荐一个简单的库。这是一个例子：https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>'''
doc = SimplifiedDoc(html)
pages = doc.pages
print ([(page.title.text,page.content.text) for page in pages])

结果：

[('Chapter 1', 'Welcome to Chapter 1'), ('Chapter 2', 'Welcome to Chapter 2')]

【讨论】：

【解决方案6】：

对于处理（导航、搜索和修改）XML 或 HTML 数据，我发现 BeautifulSoup 库非常有用。安装问题或详细信息，请点击link。

要查找属性（标签）或多属性值：

from bs4 import BeautifulSoup
data = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.48.0">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<text top="246" left="135" width="178" height="16" font="1">PALS SOCIETY OF 
CANADA</text>
<text top="261" width="86" height="16" font="1">13479 77 AVE</text>
</page>
</pdf2xml>"""

soup = BeautifulSoup(data, "lxml")
page_tag = soup.find_all('page')
details_tag = page_tag[0].find_all('text')
details_tag_count = len(details_tag)
for iter_text in range(details_tag_count):
    print("Text : ", details_tag[iter_text].text)
    print("Left tag : ", details_tag[iter_text].get("left"))

输出：

Text :  PALS SOCIETY OF CANADA
Left tag :  135
Text :  13479 77 AVE
Left tag :  None

【讨论】：