使用python在文本文件中的两个字符串之间提取值答案

【问题标题】：Extract Values between two strings in a text file using python使用python在文本文件中的两个字符串之间提取值
【发布时间】：2013-09-18 06:12:44
【问题描述】：

假设我有一个包含以下内容的文本文件

fdsjhgjhg
fdshkjhk
Start
Good Morning
Hello World
End
dashjkhjk
dsfjkhk

现在我需要编写一个 Python 代码来读取文本文件并将 Start 和 end 之间的内容复制到另一个文件。

我写了以下代码。

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer = []
keepCurrentSet = True
for line in inFile:
    buffer.append(line)
    if line.startswith("Start"):
        #---- starts a new data set
        if keepCurrentSet:
            outFile.write("".join(buffer))
        #now reset our state
        keepCurrentSet = False
        buffer = []
    elif line.startswith("End"):
        keepCurrentSet = True
inFile.close()
outFile.close()

我没有得到预期的输出我刚开始我想要得到的是开始和结束之间的所有线条。不包括开始和结束。

【问题讨论】：

这些文本文件大吗？

标签： python

【解决方案1】：

万一您的文本文件中有多个“开始”和“结束”，这会将所有数据一起导入，不包括所有“开始”和“结束”。

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
    copy = False
    for line in infile:
        if line.strip() == "Start":
            copy = True
            continue
        elif line.strip() == "End":
            copy = False
            continue
        elif copy:
            outfile.write(line)

【讨论】：

亲爱的，感谢您的回复我在真实场景中应用了相同的内容，我收到以下错误 D:\Python>Python.exe First.py Traceback (最近一次调用最后一次): File "First. py"，第 3 行，中的 infile 行：文件 "D:\Python\lib\encodings\cp1252.py"，第 23 行，解码返回 codecs.charmap_decode(input,self.errors,decoding_table)[ 0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4591: char racter maps to 你能帮我解决这个问题吗
@user2790219：这不是这段代码的错误。如果您可以发布您正在使用的文本文件，有人可能会提供帮助（我认为您应该提出一个新问题）
此代码将不包括字符串“开始”和“结束”，只是其中的内容。您将如何包含外围字符串？
@johnnydrama：只需在前两个 if 块中添加 outfile.write 行
这是一个很好的观察。但是，提供的代码意味着从“开始”和“结束”的多个实例中获取所有数据。我已经更新了我的答案以明确说明该假设

【解决方案2】：

如果文本文件不一定很大，您可以获取文件的全部内容然后使用正则表达式：

import re
with open('data.txt') as myfile:
    content = myfile.read()

text = re.search(r'Start\n.*?End', content, re.DOTALL).group()
with open("result.txt", "w") as myfile2:
    myfile2.write(text)

【讨论】：

正则表达式对于这个问题来说太过分了。此外，您不会处理其中一行是 Ender's Game 的情况（正则表达式中的 End 需要换行符）。此外，\n 的使用不是跨平台的，因为 windows 使用\r\n 作为行尾
@inspectorG4dget 根据我的经验，正则表达式永远不会矫枉过正。如果您擅长方言，它将具有可预测的行为。使用它们有助于保持你的技能，这很好，因为它们足够强大，可以处理几乎所有的文本操作。不过，您的回答很优雅，而且很震撼 +1。

【解决方案3】：

我不是 Python 专家，但这段代码应该可以胜任。

inFile = open("data.txt")
outFile = open("result.txt", "w")
keepCurrentSet = False
for line in inFile:
    if line.startswith("End"):
        keepCurrentSet = False

    if keepCurrentSet:
        outFile.write(line)

    if line.startswith("Start"):
        keepCurrentSet = True
inFile.close()
outFile.close()

【讨论】：

【解决方案4】：

使用itertools.dropwhile、itertools.takewhile、itertools.islice：

import itertools

with open('data.txt') as f, open('result.txt', 'w') as fout:
    it = itertools.dropwhile(lambda line: line.strip() != 'Start', f)
    it = itertools.islice(it, 1, None)
    it = itertools.takewhile(lambda line: line.strip() != 'End', it)
    fout.writelines(it)

更新：正如inspectorG4dget 评论的那样，上面的代码复制了第一个块。要复制多个块，请使用以下命令：

import itertools

with open('data.txt', 'r') as f, open('result.txt', 'w') as fout:
    while True:
        it = itertools.dropwhile(lambda line: line.strip() != 'Start', f)
        if next(it, None) is None: break
        fout.writelines(itertools.takewhile(lambda line: line.strip() != 'End', it))

【讨论】：

两个问题：（1）\n 不是跨平台的——Windows 使用\r\n。 (2) 这根本不处理多个块 - 它只复制第一个块
@inspectorG4dget，感谢您的 commnet。我更新了答案。

【解决方案5】：

将outFile.write 调用移动到第二个if：

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer = []
for line in inFile:
    if line.startswith("Start"):
        buffer = ['']
    elif line.startswith("End"):
        outFile.write("".join(buffer))
        buffer = []
    elif buffer:
        buffer.append(line)
inFile.close()
outFile.close()

【讨论】：

【解决方案6】：

import re

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer1 = ""
keepCurrentSet = True
for line in inFile:
    buffer1=buffer1+(line)

buffer1=re.findall(r"(?<=Start) (.*?) (?=End)", buffer1)  
outFile.write("".join(buffer1))  
inFile.close()
outFile.close()

【讨论】：

如果文件中存在Starting awesome sentence 和Ender's Game 行，这将失败

【解决方案7】：

我会这样处理：

inFile = open("data.txt")
outFile = open("result.txt", "w")

data = inFile.readlines()

outFile.write("".join(data[data.index('Start\n')+1:data.index('End\n')]))
inFile.close()
outFile.close()

【讨论】：

在最坏的情况下内存使用效率非常低，并且不能处理多个块

【解决方案8】：

如果想在提取两个字符串之间的行时保留开始和结束行/关键字。

请在下面找到我用来从 shell 脚本中提取 sql 语句的代码 sn-p

def process_lines(in_filename, out_filename, start_kw, end_kw):
    try:
        inp = open(in_filename, 'r', encoding='utf-8', errors='ignore')
        out = open(out_filename, 'w+', encoding='utf-8', errors='ignore')
    except FileNotFoundError as err:
        print(f"File {in_filename} not found", err)
        raise
    except OSError as err:
        print(f"OS error occurred trying to open {in_filename}", err)
        raise
    except Exception as err:
        print(f"Unexpected error opening {in_filename} is",  repr(err))
        raise
    else:
        with inp, out:
            copy = False
            for line in inp:
                # first IF block to handle if the start and end on same line
                if line.lstrip().lower().startswith(start_kw) and line.rstrip().endswith(end_kw):
                    copy = True
                    if copy:  # keep the starts with keyword
                        out.write(line)
                    copy = False
                    continue
                elif line.lstrip().lower().startswith(start_kw):
                    copy = True
                    if copy:  # keep the starts with keyword
                        out.write(line)
                    continue
                elif line.rstrip().endswith(end_kw):
                    if copy:  # keep the ends with keyword
                        out.write(line)
                    copy = False
                    continue
                elif copy:
                    # write
                    out.write(line)


if __name__ == '__main__':
    infile = "/Users/testuser/Downloads/testdir/BTEQ_TEST.sh"
    outfile = f"{infile}.sql"
    statement_start_list = ['database', 'create', 'insert', 'delete', 'update', 'merge', 'delete']
    statement_end = ";"
    process_lines(infile, outfile, tuple(statement_start_list), statement_end)

【讨论】：

【解决方案9】：

文件是 Python 中的迭代器，因此这意味着您不需要保存“标志”变量来告诉您要编写哪些行。您可以在到达起始线时简单地使用另一个循环，并在到达结束线时将其中断：

with open("data.txt") as in_file, open("result.text", 'w') as out_file:
    for line in in_file:
        if line.strip() == "Start":
            for line in in_file:
                if line.strip() == "End":
                    break
                out_file.write(line)

【讨论】：