从 .txt 文件中提取文本并保存到带有列和标题的 .csv 文件中答案

【问题标题】：Extract text from .txt file and save into .csv files with columns and header从 .txt 文件中提取文本并保存到带有列和标题的 .csv 文件中
【发布时间】：2019-06-17 21:48:14
【问题描述】：

我有大约 100 个包含 1-2 段临床笔记的文本文件。每个文件相应地命名为 doc_1.txt 到 doc_179.txt。我想将每个文件中的文本保存到一个 .csv 文件中，该文件有 2 列带标题（id、文本）。 id 列是每个文件的名称。

例如doc_1是记录文件名，将成为id。 doc_1 中的文本将存储在 text column 中。期望的结果如下


|   id  | text |
|:-----:|:----:|
| doc_1 | abcf |
| doc_2 | efrf |
| doc_3 | gvni |

到目前为止，我只是查看了文本，还没有确定实现我的结果的最佳实用方法。

【问题讨论】：

你查看过 Python 3 中的csv library 吗？它允许您将文件的每一行读入 csv，并且您可以指定分隔符。
@jhelphenstine 不，我没有尝试过 csv 库。查看类似的代码，我想我必须附加文件名和文本。

标签： python-3.x pandas csv dataframe nlp

【解决方案1】：

假设您有一个文件列表。

import pandas as pd # remove if already imported

# ...

files_list = ["doc_1.txt", "doc_2.txt", ..., "doc_179.txt"]

使用必要的列创建 DataFrame：

df = pd.DataFrame(columns=["id", "text"])

遍历每个文件以读取文本，然后保存到 csv 文件中

for file in files_list:
    with open(file) as f:
        txt = f.read() # to retrieve the text in the file
        file_name = file.split(".")[0] # to remove file type
        df = df.append({"id": file_name, "text": txt}, ignore_index=True) # add row to DataFrame


df.to_csv("result.csv", sep="|", index=False) # export DataFrame into csv file

随意更改输出 csv 文件的名称 (result.csv) 和用于sep 的字符。

强烈建议不要使用已包含在任何文件文本中的字符。（例如，如果任何文本文件的文本中已经包含逗号，则不要使用, 作为sep 的值。）

【讨论】：

非常感谢！如果我能投多于一票，我会的。

【解决方案2】：

我想更新提供给我的解决方案以解决我的问题。

import pandas as pd

import glob

txtfiles = []
for file in glob.glob("*.txt"):
    txtfiles.append(file)

files_list = [f for f in glob.glob("*.txt")]

df = pd.DataFrame(columns=["id", "text"])

for file in files_list:
    with open(file) as f:
        txt = f.read() # to retrieve the text in the file
        file_name = file.split(".")[0] # to remove file type
        df = df.append({"id": file_name, "text": txt}, ignore_index=True)

【讨论】：