从文本文件创建多个 txt 文件答案

【问题标题】：Creating multiple txt files from a text file从文本文件创建多个 txt 文件
【发布时间】：2021-09-21 05:00:33
【问题描述】：

我正在尝试从 Project Gutenberg 获取联邦党人文件并将其转换为文本文档。 Project Gutenberg 的问题是每篇论文都没有被分离出来——它作为一个大文本文件读入，所以我必须告诉 Python 为每篇 Federalist Paper 创建一个新的文本文件（它们都包含在短语 "FEDERALIST No. _" 之间和"PUBLIUS")。

我的代码大部分都有效，但我遇到的问题是它创建的第一个文本文件（根据我的代码命名为1.txt）。当我打开这个文件时，它包含从古腾堡计划中抓取的整个原始文本，而不仅仅是联邦党人 1 的文本。文件2.txt 然后只有联邦党人 1 的内容，它正确地剪切了文本，它只是现在偏移从应该是 1 的文件中。

我怀疑我的问题出在 for-loop 的某个地方，可能与我初始化变量的方式有关，但我看不出是哪里导致了这个错误。

# Importing the doc and creating individual txt files for each federalist paper

url = "https://www.gutenberg.org/files/1404/1404.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

# finding the start and end of the portion of the doc we care about and subsetting
raw.find("FEDERALIST No. 1")
raw.rfind("PUBLIUS")
raw = raw[821:1167459]
# fixing the doc again... yeah this ain't clean but it's right
raw = raw[0:1166638]
# save as txt to work with below
print(raw, file=open("all.txt", "a"))

# looping over the whole text to break it into individual text docs by each
# federalist paper
with open("all.txt") as fo:
    op = ''
    start = 0
    cntr = 1
    paper = 1
    for x in fo.read().split("\n"):  # looping over the text by each line split
        if x == 'FEDERALIST No. ' + str(paper):  # creating new txt if we
                                                 # encounter a new fed paper
            if start == 1:
                with open(str(cntr) + '.txt', 'w') as opf:
                    opf.write(op)
                    opf.close()
                    op = ''
                    cntr += 1
                    paper += 1
            else:
                start = 1
        else:
            if op == '':
                op = x
            else:
                op = op + '\n' + x
    fo.close()

【问题讨论】：

如果您使用with打开文件，则不需要.close，因为它会在退出语句时自动发生

标签： python loops web-scraping text txt

【解决方案1】：

您可以使用re 模块来拆分文本：

import re
import requests


url = "https://www.gutenberg.org/files/1404/1404.txt"
text = requests.get(url).text

r = re.compile(
    r"^(FEDERALIST No\..*?)(?=^PUBLIUS|^FEDERALIST)", flags=re.M | re.S
)
for i, section in enumerate(r.findall(text), 1):
    with open("{}.txt".format(i), "w") as f_out:
        f_out.write(section)

这将创建 85 个.txt 文件，每个文件都包含论文中的部分。

【讨论】：