用子节点创建多个同名节点答案

【问题标题】：Create multiple nodes having the same name with sub nodes用子节点创建多个同名节点
【发布时间】：2019-03-11 23:31:49
【问题描述】：

我有一个文本文件，我使用 python 使用 xml.etree.cElementTree 库对其进行了解析。在输入中我有一段<p>包含句子<s>，每个句子都有单词<w>，这是文本文件的样子：

This
is
my
first
sentence.
This
is
my
second
sentence.

在输出中我想要以下 xml 文件：

<p>
   <s>
      <w>this</w>
      <w>is</w>
      <w>my</w>
      <w>first</w>
      <w>sentence</w>
      <pc>.</pc>
   </s>
   <s>
      <w>this</w>
      <w>is</w>
      <w>my</w>
      <w>second</w>
      <w>sentence</w>
      <pc>.</pc>
   </s>
</p>

我写了下面的python代码，给了我段落标签和单词标签，我不知道如何实现有多个<s>标签的案例。句子以大写字母开头，以点结尾。我的python代码：

source_file = open("file.txt", "r")
for line in source_file:
    # catch ponctuation : . and , and ! and ? and ()
    if re.match("(\(|\)|\.|\,|\!)", str(line)):
        ET.SubElement(p, "pc").text = line
    else:
        ET.SubElement(p, "w").text = line

tree.write("my_file.xml", encoding="UTF-8", xml_declaration=True)

以下 xml 输出：

<?xml version="1.0" encoding="UTF-8"?>
<p>
   <w>this</w>
   <w>is</w>
   <w>my</w>
   <w>first</w>
   <w>sentence</w>
   <pc>.</pc>
   <w>this</w>
   <w>is</w>
   <w>my</w>
   <w>second</w>
   <w>sentence</w>
   <pc>.</pc>
</p>

我面临的问题是我无法为每个新句子创建一个新的<s> 标签，有没有办法使用 python 使用 xml 库来做到这一点？

【问题讨论】：

如何识别新句子？每行是一个新句子或每个句点（点）是？你可以使用类似下面的东西，s = ET.Element('s') 然后 w = ET.SubElement(s, 'w')
新句子定义为第一个单词以大写字母开头，并以包含（点）的<pc>标签结尾。我试过你之前说的，但是当循环遍历每一行时，我怎么能使用之前为第二句创建的 s = ET.Element('s') ！

标签： python xml celementtree

【解决方案1】：

基本上，您将需要一个逻辑来识别新句子。忽略明显的部分，应该像下面这样，

import os
eos = False
s = ET.SubElement(p, 's')
for line in source_file:
    line = str(line).rstrip(os.linesep) #to remove new line char at the end of each line
    # catch ponctuation : . and , and ! and ? and ()
    if re.match("(\(|\)|\.|\,|\!)", line):   #don't think this matches 'sentence.', you will need to verify
        ET.SubElement(s, "pc").text = line
        eos = True
    else:
        if eos and line.strip() and line[0].isupper():
            s = ET.SubElement(p, 's')
        eos = False
        ET.SubElement(s, "w").text = line

另外，您的正则表达式可能需要修复

【讨论】：

谢谢，这就是我在拼图中所缺少的，我尝试像你一样添加标志来控制句子的结尾并以意大利面条代码结束哈哈。在修改了我的代码并应用了你的实现之后就像一个魅力！再次感谢。