python - 从特定文本行读取文件答案

【问题标题】：python - Read file from and to specific lines of textpython - 从特定文本行读取文件
【发布时间】：2011-09-26 18:17:41
【问题描述】：

我不是在谈论特定的行号，因为我正在读取具有相同格式但长度不同的多个文件。
假设我有这个文本文件：

Something here...  
... ... ...   
Start                      #I want this block of text 
a b c d e f g  
h i j k l m n  
End                        #until this line of the file
something here...  
... ... ...

我希望你明白我的意思。我正在考虑遍历文件，然后使用正则表达式搜索以找到“开始”和“结束”的行号，然后使用 linecache 从开始行读取到结束行。但是如何获得行号？我可以使用什么功能？

【问题讨论】：

这个问题和stackoverflow.com/questions/7098530/…这个问题很相似
它也类似于stackoverflow.com/a/9222120/2641825，它使用正则表达式有一个很好的答案。电话是re.findall(r'Start(.*?)End',data,re.DOTALL)，类似于下面@pyInTheSky 的回答。

标签： python file linecache

【解决方案1】：

如果您只是想要Start 和End 之间的文本块，您可以执行以下简单操作：

with open('test.txt') as input_data:
    # Skips text before the beginning of the interesting block:
    for line in input_data:
        if line.strip() == 'Start':  # Or whatever test is needed
            break
    # Reads text until the end of the block:
    for line in input_data:  # This keeps reading the file
        if line.strip() == 'End':
            break
        print line  # Line is extracted (or block_of_lines.append(line), etc.)

实际上，您无需操作行号即可读取开始和结束标记之间的数据。

逻辑（“读取直到...”）在两个块中重复，但它非常清晰和有效（其他方法通常涉及检查某些状态[在块之前/块内/到达块结束]，这会产生时间罚款）。

【讨论】：

这意味着在 break 语句之后，下一个 for 循环从第一个 for 循环离开读数的地方读取行。
在相同的打开和关闭文本中多次出现块怎么办？
这是个好问题。这并不像在with 语句中添加一个循环那么简单：困难在于当文件被完全读取时停止迭代，同时将其与标记检测逻辑相结合。这值得一个单独的问题。
不应该有第一个'break'实例吗？
您还有什么建议？如果没有中断，第一个循环将改为读取整个文件，并且不会打印任何内容。

【解决方案2】：

以下是可行的方法：

data_file = open("test.txt")
block = ""
found = False

for line in data_file:
    if found:
        block += line
        if line.strip() == "End": break
    else:
        if line.strip() == "Start":
            found = True
            block = "Start"

data_file.close()

【讨论】：

@BPm：这是“有限状态机”(en.wikipedia.org/wiki/Finite_state_machine) 的示例：机器以“尚未找到块”(found==False) 的状态启动，然后继续运行状态“在块内”（found==True），在这种情况下，当找到“结束”时停止。它们的效率可能有点低（这里，found 必须检查块中的每一行），但状态机通常允许人们清晰地表达更复杂算法的逻辑。
+1，因为这是完全有效的状态机方法的一个很好的例子。
感谢“有限状态机”参考！

【解决方案3】：

您可以很容易地使用正则表达式。您可以根据需要使其更健壮，下面是一个简单的示例。

>>> import re
>>> START = "some"
>>> END = "Hello"
>>> test = "this is some\nsample text\nthat has the\nwords Hello World\n"
>>> m = re.compile(r'%s.*?%s' % (START,END), re.S)
>>> m.search(test).group(0)
'some\nsample text\nthat has the\nwords Hello'

【讨论】：

+1：非常好的主意：这是紧凑的，并且可能非常有效，因为 re 模块很快。但是，在您的正则表达式 (^…$) 中，应该将 START 和 END 标签强制单独放在一行上。
谢谢 :) .. 我不认为你可以使用 ^ || $ 当你使用 re.S 规范时，因为它包含换行符，认为你需要明确地说 '%s\n.*?%s\n'
在这种情况下，您当然可以使用 ^...$，只需添加 re.MULTILINE 标志 (docs.python.org/dev/library/re.html#module-contents)。
你是对的。出于某种原因，我认为使用 ^/$ 时 .S 与 .M 冲突，但事实并非如此，所以感谢您的评论

【解决方案4】：

这应该是你的开始：

started = False
collected_lines = []
with open(path, "r") as fp:
     for i, line in enumerate(fp.readlines()):
         if line.rstrip() == "Start": 
             started = True
             print "started at line", i # counts from zero !
             continue
          if started and line.rstrip()=="End":
             print "end at line", i
             break
          # process line 
          collected_lines.append(line.rstrip())

enumerate 生成器采用生成器并枚举迭代。例如。

  print list(enumerate("a b c".split()))

打印

   [ (0, "a"), (1,"b"), (2, "c") ]

更新：

发帖人要求使用正则表达式来匹配诸如“===”和“======”之类的行：

import re
print re.match("^=+$", "===")     is not None
print re.match("^=+$", "======")  is not None
print re.match("^=+$", "=")       is not None
print re.match("^=+$", "=abc")    is not None
print re.match("^=+$", "abc=")    is not None

【讨论】：