从多个文本文件中提取想要的单词（Python 3.6）答案

【问题标题】：Extracting wanted words from multiple text files (Python 3.6)从多个文本文件中提取想要的单词（Python 3.6）
【发布时间】：2020-12-15 14:12:29
【问题描述】：

我有一个包含约 100 000 个 txt 文件的文件夹。我正在尝试读取所有文件并创建一个包含两列 id 和 text 的 DataFrame。对于 id，我从我的文件名中获取数字，例如，文件 BL2334_uyhjghbvbvhf，我提取下划线之前的所有内容，所以在这个例子中我的 id 是 BL2334。在创建数据框之前，我想只提取 Detected Text 中的单词：... 所以在这个文件中的单词 BUCK、NIP、Preerfal Deet Attracter。

我的文件：

Id: 02398123-a642-4e3f-88a7
Type: LINE
Detected Text: BUCK
Confidence: 77.965172
Id: c85bbbe
Type: LINE
Detected Text: NIP
Confidence: 97.186539
Id: 28926a7a-78024c80-b9c5
Type: LINE
Detected Text: Preerfal Deet Attracter
Confidence: 47.749722

我的代码：

import os
import pandas as pd

path = r'C:\Users\example\Documents\MyFolder'

file_list = []

for (root, dirs, files) in os.walk(path, topdown=True):
        file_list.append([root + "\\" + file for file in files])
def flatten(file_list):
    result_list_files = []
    for element in file_list:
        if isinstance(element, str):
            result_list_files.append(element)
        else:
            for element_1 in flatten(element):
                result_list_files.append(element_1)
    return result_list_files 
result_flatten = flatten(file_list)

final_df = pd.DataFrame()

for file in result_flatten:
    temp_df = pd.DataFrame()
    id = file.split('\\')[-1].split('_')[0]
    temp_df['id'] = [id]
    temp_df['text'] = [open(file,encoding="utf8").read()]
    final_df = pd.concat([final_df, temp_df], ignore_index = True)

【问题讨论】：

你的代码有什么问题？
我得到一个包含两列 id 和 text 的输出。这需要很长时间，在我的文本列中，我从我的文件中获得了一切，我只需要来自 Detecte Text 的单词：

标签： python python-3.x pandas

【解决方案1】：

只是扩展@Luca Angioioni' 解决方案，您可以使用类似的东西：

import os
import re

data = {'id': [], 'text': []}

for (root, dirs, files) in os.walk(path):
    for file in files:
        data['id'].append(file.split('_')[0])
        with open(os.path.join(root, file)) as f:
            data['text'].append(re.findall('Detected Text: (.*)\n', f.read()))

df = pd.DataFrame(data)

它将返回一个 id，其中包含每行 text 中的匹配列表。不过，您始终可以使用df.explode('text') 将匹配项分解为它们自己的行，但具有重复的 ID。

如果由于某种原因不想使用re，可以将最后一行替换为：

data['text'].append([line.split(':')[1].strip() for line in f if line.startswith('Detected Text')])

它应该也可以工作。

【讨论】：

【解决方案2】：

要仅获取 Detected Text 部分，我将使用正则表达式。示例：

import re

text = """
Id: 02398123-a642-4e3f-88a7
Type: LINE
Detected Text: BUCK
Confidence: 77.965172
Id: c85bbbe
Type: LINE
Detected Text: NIP
Confidence: 97.186539
Id: 28926a7a-78024c80-b9c5
Type: LINE
Detected Text: Preerfal Deet Attracter
Confidence: 47.749722
"""

pattern = re.compile(r"Detected Text: (.*)\n")
match = pattern.findall(text)  # match becomes ['BUCK', 'NIP', 'Preerfal Deet Attracter']

让您的代码变慢的一件事是您不断分配新的数据帧，然后将它们连接起来。解决此问题的一种方法是首先使用key = id, value = text 创建字典，然后使用from_dict 方法将其转换为DF：documentation。或者，您可以使用像 (id, text) 这样的元组列表，然后：
```
tuples = [
("id1", "some text")
("id2", "some other text")
...
]
final_df = pd.DataFrame(tuples, columns=['id', 'text'])
```

【讨论】：

很抱歉，我不确定我是否理解您的解决方案。在 final_df = pd.DataFrame() 之前我需要创建字典吗？
@DinkoJantoš 不，在您的for file in result_flatten: 中不要创建 df，而是附加到字典，甚至是元组列表......无论如何。然后仅在 for 循环结束时将其转换为数据框。我用一个例子更新了答案。
感谢您的帮助。