【问题标题】:Python - How to nest file read loops?Python - 如何嵌套文件读取循环?
【发布时间】:2012-07-20 19:20:43
【问题描述】:

2 天前,我第一次接触 Python(以及一般的编程)。今天我被困住了。我花了几个小时试图找到一个我怀疑是一个如此微不足道的问题的答案,没有其他人被困在这里:)

老板要我手动清理 HUGE .xml 文件,使其更易于阅读。我正在尝试创建一个脚本来为我做这件事。以下是 .xml 文件的示例以及我想要的输出。

输入(文件.xml):

<IssueTracking>
  <Issue>
    <SequenceNum>123</SequenceNum>
    <Subject>Subject of Ticket 123</Subject>
    <Description>Line 1 in Description field of Ticket 123.
Line 2 in Description field of Ticket 123.
Line 3 in Description field of Ticket 123.</Description>
  </Issue>
  <Issue>
    <SequenceNum>124</SequenceNum>
    <Subject>Subject of Ticket 124</Subject>
    <Description>Line 1 in Description field of Ticket 124.
Line 2 in Description field of Ticket 124.
Line 3 in Description field of Ticket 124.</Description>
  </Issue>
</IssueTracking>

期望的输出:

123    Subject of Ticket 123
Line 1 in Description field of Ticket 123.
Line 2 in Description field of Ticket 123.
Line 3 in Description field of Ticket 123.

124    Subject of Ticket 124
Line 1 in Description field of Ticket 124.
Line 2 in Description field of Ticket 124.
Line 3 in Description field of Ticket 124.

这是我到目前为止所得到的。

with open(File.xml, 'r') as SourceFile: # Opens the file
    while 1: # Keep going through the file to the end
        SourceFileLine = SourceFile.readline() # Saves lines of the source file
        if not SourceFileLine: # Skip empty lines
            break

        SourceFileLine = SourceFileLine.strip() # Strips the whitespace

        if "<SequenceNum>" in SourceFileLine:
            SequenceNum = SourceFileLine[13:-14]  # Trims the tags, saves the field.
            continue

        if "<Subject>" in SourceFileLine:
            Subject = SourceFileLine[9:-10]
            continue

        #if "<Description>" in SourceFileLine:
        #    last_pos = SourceFile.tell() 
        #    while "</Description>" not in SourceFileLine:
        #        SourceFile.seek(last_pos)
        #        ?????
        #    
        #    Description = Description[22:]
        #    continue

        if "</Issue>" in SourceFileLine:
            print(SequenceNum, end = "\t")
            print(Subject)
        #    print(Description)
            print("\n")

我一直在将&lt;Description&gt; 标记之间的这三行标识并保留为一个字符串,我可以在继续查看源文件之前打印出来。现在已经扫描了几十个文件行读取循环的其他示例,我怀疑我需要标记到达目标字段的点并在文件中的该点嵌套另一个读取循环。但是我还没有找到另一个这样做的例子,所以我认为我缺少一些基本的东西或者有更好的方法。提前感谢您的帮助!

【问题讨论】:

  • Python 有一个内置的 XML 解析器:docs.python.org/library/pyexpat.html
  • +1 表示输入、所需输出以及您尝试过的内容。
  • 提取数据后,您可能应该使用 YAML 等人性化的序列化程序来输出数据。您永远不知道何时需要再次处理这些数据。

标签: python loops readline


【解决方案1】:

我强烈推荐使用 lxml 处理数据的示例。 (注意:为 Py2.x 编写,但很容易适应 Py3.x)

from lxml import etree
xml = """<IssueTracking>
  <Issue>
    <SequenceNum>123</SequenceNum>
    <Subject>Subject of Ticket 123</Subject>
    <Description>Line 1 in Description field of Ticket 123.
Line 2 in Description field of Ticket 123.
Line 3 in Description field of Ticket 123.</Description>
  </Issue>
  <Issue>
    <SequenceNum>124</SequenceNum>
    <Subject>Subject of Ticket 124</Subject>
    <Description>Line 1 in Description field of Ticket 124.
Line 2 in Description field of Ticket 124.
Line 3 in Description field of Ticket 124.</Description>
  </Issue>
</IssueTracking>
"""

root = etree.fromstring(xml)
for issue in root.findall('Issue'):
    as_list = [issue.find(n).text for n in ('SequenceNum', 'Subject', 'Description')]
    as_list[2] = as_list[2].split('\n')
    print as_list

打印:

['123', 'Subject of Ticket 123', ['Line 1 in Description field of Ticket 123.', 'Line 2 in Description field of Ticket 123.', 'Line 3 in Description field of Ticket 123.']]
['124', 'Subject of Ticket 124', ['Line 1 in Description field of Ticket 124.', 'Line 2 in Description field of Ticket 124.', 'Line 3 in Description field of Ticket 124.']]

【讨论】:

    【解决方案2】:

    请不要像这样读取 XML 文件,对于 python,有各种库可以帮助读取 XML 文件。

    查看 python 库lxml,它提供了一种非常简单的方法来读取和解析 XML 文件,它将大大改进您的代码。

    我会解释如何使用库本身,但他们的文档比我可以挤进这个文本区域要好得多:http://lxml.de/tutorial.html

    【讨论】:

    • 谢谢,我会研究这个并弄明白的。感谢您的帮助。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-08-26
    • 1970-01-01
    • 2016-11-15
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多