恢复嵌套的 for 循环答案

【问题标题】：Resuming a nested for-loop恢复嵌套的 for 循环
【发布时间】：2013-11-28 16:50:05
【问题描述】：

两个文件。一个有损坏的数据，另一个有修复。破碎：

ID 0
T5 rat cake
~EOR~
ID 1
T1 wrong segg
T2 wrong nacob
T4 rat tart
~EOR~
ID 3
T5 rat pudding
~EOR~
ID 4
T1 wrong sausag
T2 wrong mspa
T3 strawberry tart 
~EOR~
ID 6
T5 with some rat in it 
~EOR~

修复：

ID 1
T1 eggs
T2 bacon
~EOR~
ID 4
T1 sausage
T2 spam
T4 bereft of loif
~EOR~

EOR 表示记录结束。请注意，Broken 文件比修复文件包含更多记录，修复文件具有要修复的标签（T1、T2 等是标签）和要添加的标签。这段代码完全按照它应该做的：

# foobar.py

import codecs

source = 'foo.dat'
target = 'bar.dat' 
result = 'result.dat'  

with codecs.open(source, 'r', 'utf-8_sig') as s, \
     codecs.open(target, 'r', 'utf-8_sig') as t, \
     codecs.open(result, 'w', 'utf-8_sig') as u: 

    sID = ST1 = sT2 = sT4 = ''
    RecordFound = False

    # get source data, record by record
    for sline in s:
        if sline.startswith('ID '):
            sID = sline
        if sline.startswith('T1 '):
            sT1 = sline
        if sline.startswith('T2 '):
            sT2 = sline
        if sline.startswith('T4 '):
            sT4 = sline
        if sline.startswith('~EOR~'):
            for tline in t: 
                # copy target file lines, replacing when necesary
                if tline == sID:
                    RecordFound = True
                if tline.startswith('T1 ') and RecordFound:
                    tline = sT1
                if tline.startswith('T2 ') and RecordFound:
                    tline = sT2 
                if tline.startswith('~EOR~') and RecordFound:
                    if sT4:
                        tline = sT4 + tline
                    RecordFound = False
                    u.write(tline)
                    break

                u.write(tline)

    for tline in t:
        u.write(tline)

我正在写入一个新文件，因为我不想弄乱其他两个文件。第一个外部 for 循环在修复文件中的最后一条记录处结束。此时，目标文件中仍有记录要写入。这就是最后一个 for 子句的作用。

最后一行隐含地指出了第一个内部 for 循环最后一次中断的地方，这让我很烦。就好像它应该说'for the rest of tline in t'。另一方面，我不知道如何用更少（或不多）的代码行（使用字典和你有什么）来做到这一点。我应该担心吗？

请发表评论。

【问题讨论】：

我会创建一个计数器“tPosition”，每次通过相关循环时都会增加该计数器。然后，当你想说“for the rest of tline in t”时，你可以表明你想循环类似：for tline in t[tPosition:]

标签： python for-loop coding-style

【解决方案1】：

我不会担心的。在您的示例中，t 是一个文件句柄，您正在对其进行迭代。 Python 中的文件句柄是它们自己的迭代器；他们有关于他们在文件中读取位置的状态信息，并且在您迭代它们时会保留它们的位置。您可以查看file.next() 的python 文档以获取更多信息。

另请参阅另一个关于迭代器的 SO 答案：What does the "yield" keyword do in Python?。那里有很多有用的信息！

编辑：这是使用字典组合它们的另一种方法。如果您想在输出之前对记录进行其他修改，则可能需要此方法：

import sys

def get_records(source_lines):
    records = {}
    current_id = None
    for line in source_lines:
        if line.startswith('~EOR~'):
            continue
        # Split the line up on the first space
        tag, val = [l.rstrip() for l in line.split(' ', 1)]
        if tag == 'ID':
            current_id = val
            records[current_id] = {}
        else:
            records[current_id][tag] = val
    return records

if __name__ == "__main__":
    with open(sys.argv[1]) as f:
        broken = get_records(f)
    with open(sys.argv[2]) as f:
        fixed = get_records(f)

    # Merge the broken and fixed records
    repaired = broken
    for id in fixed.keys():
        repaired[id] = dict(broken[id].items() + fixed[id].items())

    with open(sys.argv[3], 'w') as f:
        for id, tags in sorted(repaired.items()):
            f.write('ID {}\n'.format(id))
            for tag, val in sorted(tags.items()):
                f.write('{} {}\n'.format(tag, val))
            f.write('~EOR~\n')

dict(broken[id].items() + fixed[id].items()) 部分利用了这一点： How to merge two Python dictionaries in a single expression?

【讨论】：

谢谢！ file.next() 的链接有我正在寻找的确认。我已经遇到过收益解释。
直接跳过~EOR~s 不好。如果~EOR~ 行之后没有“ID”，您将损坏数据。在这种情况下，您需要raise。
在repaired = broken、repaired 行中 - 只是broken 的别名（它是相同的字典），因此您可以改变原始数据。这样的代码风格总是在未来的开发中带来错误。您需要broken 的深层副本。或者您必须重命名这些变量。
另外id 是内置的，你可以遮蔽它。 tag, val = [l.rstrip() for l in line.split(' ', 1)] 将在删除数据的行上引发 'T1 ' -> raise
感谢 cmets @akaRem。这是一个玩具示例，展示了 OP 可能不知道的一些 Python 语法和语义。我假设所有输入都是格式正确的，并且避免了包括错误检查。

【解决方案2】：

# building initial storage

content = {}
record = {}
order = []
current = None

with open('broken.file', 'r') as f:
    for line in f:
        items = line.split(' ', 1)
        try:
            key, value = items
        except:
            key, = items
            value = None

        if key == 'ID':
            current = value
            order.append(current)
            content[current] = record = {}
        elif key == '~EOR~':
            current = None
            record = {}
        else:
            record[key] = value

# patching

with open('patches.file', 'r') as f:
    for line in f:
        items = line.split(' ', 1)
        try:
            key, value = items
        except:
            key, = items
            value = None

        if key == 'ID':
            current = value
            record = content[current]  # updates existing records only!
            # if there is no such id -> raises

            # alternatively you may check and add them to the end of list
            # if current in content: 
            #     record = content[current]
            # else:
            #     order.append(current)
            #     content[current] = record = {}

        elif key == '~EOR~':
            current = None
            record = {}
        else:
            record[key] = value

# patched!
# write-out

with open('output.file', 'w') as f:
     for current in order:
         out.write('ID '+current+'\n')
         record = content[current]
         for key in sorted(record.keys()):
             out.write(key + ' ' + (record[key] or '') + '\n')  

# job's done

问题？

【讨论】：

谢谢。我喜欢你处理记录的方法。我想它确实比我的更 Pythonic。（我发现 Pythonicness 是一个相当困难的主题。你可以使用很多东西，但你自己找不到）。您的代码在 EOR 行上崩溃，“需要超过 1 个值才能解压”，我猜 curent 应该是 current，但这并不重要。无需讨论。
@RolfBly 我在没有安装 python 的情况下在 PC 上写了这个，所以.. 我没有测试它。对错误感到抱歉。我会修复它们。
@RolfBly 我添加了修复
我花了一段时间才不得不尝试您的代码。第一个 record[key] = value 结果为 TypeError: 'NoneType' object does not support item assignment。当然，输出文件仍然是空的。
另外，第二个 for 循环不做任何修补。

【解决方案3】：

为了完整起见，并且只是为了分享我的热情和我学到的东西，下面是我现在使用的代码。它回答了我的 OP 等等。

它部分基于上述 akaRem 的方法。一个函数填充一个字典。它被调用了两次，一次用于修复文件，一次用于要修复的文件。

import codecs, collections
from GetInfiles import *

sourcefile, targetfile = GetInfiles('dat')
    # GetInfiles reads two input parameters from the command line,
    # verifies they exist as files with the right extension, 
    # and then returns their names. Code not included here. 

resultfile = targetfile[:-4] + '_result.dat'  

def recordlist(infile):
    record = collections.OrderedDict()
    reclist = []

    with codecs.open(infile, 'r', 'utf-8_sig') as f:
        for line in f:
            try:
                key, value = line.split(' ', 1)

            except:
                key = line 
                # so this line must be '~EOR~\n'. 
                # All other lines must have the shape 'tag: content\n'
                # so if this errors, there's something wrong with an input file

            if not key.startswith('~EOR~'):
                try: 
                    record[key].append(value)
                except KeyError:
                    record[key] = [value]

            else:
                reclist.append(record)
                record = collections.OrderedDict()

    return reclist

# put files into ordered dicts            
source = recordlist(sourcefile)
target = recordlist(targetfile)

# patching         
for fix in source:
    for record in target:
        if fix['ID'] == record['ID']:
            record.update(fix)

# write-out            
with codecs.open(resultfile, 'w', 'utf-8_sig') as f:
    for record in target:
        for tag, field in record.iteritems():
            for occ in field: 
                line = u'{} {}'.format(tag, occ)
                f.write(line)

        f.write('~EOR~\n')

它现在是一个有序的字典。这不在我的 OP 中，但文件需要人工交叉检查，因此保持顺序会更容易。（Using OrderedDict is really easy。我第一次尝试找到这个功能让我觉得很奇怪，但它的文档让我很担心。没有例子，令人生畏的行话......）

此外，它现在支持在记录中多次出现任何给定标签。这也不在我的 OP 中，但我需要这个。（这种格式称为“Adlib 标记”，它是编目软件。）

与 akaRem 的方法不同的是补丁，使用update 作为目标字典。我发现这和 python 一样，真的很优雅。同样适用于startswith。这是我无法抗拒分享它的另外两个原因。

我希望它有用。

【讨论】：