【发布时间】:2019-09-23 07:21:57
【问题描述】:
我正在尝试将文本文件中的数据结构化到 XML 文件中,标记我想用 XML 标记器标记的部分文本。
问题。 xml.etree.ElementTree 无法识别字符串
到目前为止的代码。
import xml.etree.ElementTree as ET
with open('input/application_EN.txt', 'r') as f:
application_text=f.read()
我要做的第一件事是标记段落。文本应如下所示:
<description>
<paragraph id=1>
blabla
</paragraph>
<paragraph id=2>
blabla
</paragraph>
...
</description>
到目前为止我编码:
# splitting the text into paragraphs
list_of_paragraphs = application_text.splitlines()
# creating a new list where no_null paragraphs will be added
list_of_paragraphs_no_null=[]
# counter of paragraphs of the XML file
j=0
# Create the XML file with the paragraphs
for i,paragraph in enumerate(list_of_paragraphs):
# Adding only the paragraphs different than ''
if paragraph != '':
j = j + 1
# be careful with the space after and before the tag.
# Adding the XML tags per paragraph
xml_element = '<paragraph id=\"' + str(j) +'\">' + paragraph.strip() + ' </paragraph>'
# Now I pass the whole string to the XML constructor
root = ET.fromstring(description_text)
我收到此错误:
格式不正确(无效标记):第 1 行,第 6 列
经过一番调查,我意识到错误是由文本包含符号“&”这一事实引起的。 在几个地方添加和取出“&”证实了这一点。
问题是为什么?为什么“&”不被视为文本。我能做什么?
我知道我可以替换所有“&”,但我会因为“& Co”而丢失信息。是一个相当重要的字符串。 我希望文本保持完整。 (内容不变)。
建议?
谢谢。
编辑: 为了让这里更容易,你有我正在处理的文本的初学者(而不是打开一个文件,你可能会添加它来检查它):
application_text='Language=English
Has all kind of kind of references. also measures.
Photovoltaic solar cells for directly converting radiant energy from the sun into electrical energy are well known. The manufacture of photovoltaic solar cells involves provision of semiconductor substrates in the form of sheets or wafers having a shallow p-n junction adjacent one surface thereof (commonly called the "front surface"). Such substrates may include an insulating anti-reflection ("AR") coating on their front surfaces, and are sometimes referred to as "solar cell wafers". The anti-reflection coating is transparent to solar radiation. In the case of silicon solar cells, the AR coating is often made of silicon nitride or an oxide of silicon or titanium. Such solar cells are manufactured and sold by E.I. duPont de Nemeurs & Co.'
正如您在末尾看到的那样,有一个符号“& Co.”。这会造成麻烦。
【问题讨论】:
-
你看过 BeautifulSoup
-
没有。读一点。但你的意思是为了创建 XML 文件使用 BS?这
-
如果您正在创建,您可以使用 BS,然后将其转储到文件中。如果您正在阅读,您可以加载一个文件,然后相应地查询它。我在您的帖子中没有看到任何关于 & 的信息,因此您需要向我们提供示例数据以帮助我们了解具体情况
-
@fallenreaper 但您的意思是您可以使用 BS 将文本转储到 XML 格式的文件中?