在 spacy 中读取文本文件的语料库答案

【问题标题】：read corpus of text files in spacy在 spacy 中读取文本文件的语料库
【发布时间】：2019-02-27 07:01:21
【问题描述】：

我看到的所有使用 spacy 的示例都只是在单个文本文件中读取（即很小的）。如何将文本文件语料库加载到 spacy 中？

我可以通过酸洗语料库中的所有文本来使用 textacy 来做到这一点：

docs =  textacy.io.spacy.read_spacy_docs('E:/spacy/DICKENS/dick.pkl', lang='en')

for doc in docs:
    print(doc)

但我不清楚如何使用此生成器对象（文档）进行进一步分析。

另外，我宁愿使用 spacy，而不是 textacy。

spacy 也无法读取单个大文件（约 2000000 个字符）。

感谢任何帮助...

拉维

【问题讨论】：

我和你一样惊讶于找不到一个关于这方面的例子......

标签： nlp multiprocessing generator pipeline spacy

【解决方案1】：

如果你可以将你的语料库转换成一个数据框，每行对应一个文档，你基本上可以编写一个函数来做你想做的事情，然后：

df['new_column'] = df['document'].apply(lambda x: your_function(x))

或者，我不确定这是否是您想要的，但您可以尝试以下方法：

import spacy
import os

nlp = spacy.load('en_core_web_lg')
docs ='path\\to\\the\\corpus_folder'

def get_filename(path):
    return [i.path for i in os.scandir(path) if i.is_file()]

files=get_filename(docs)
for filepath in files:
    with open(filepath, 'r') as file_to_read:
        some_text = file_to_read.read()
        print(os.path.basename(filepath))
        print(nlp(some_text))
        print([tok.text for tok in nlp.tokenizer(some_text) if not tok.is_stop])
        print('-'*40)

这是输出：

text1.txt
Read multiple files.
['Read', 'multiple', 'files', '.']
----------------------------------------
text2.txt
Read it, man.
['Read', ',', 'man', '.']
----------------------------------------

但是，它不是用 spaCy 读取的。

【讨论】：

【解决方案2】：

所以我终于得到了这个工作，它应该保存在这里以供后代使用。

从一个生成器开始，这里命名为iterator，因为我目前太害怕改变任何东西，因为担心它会再次损坏：

def path_iterator(paths):
    for p in paths:
        print("yielding")
        yield p.open("r").read(25)

获取迭代器、生成器或路径列表：

my_files = Path("/data/train").glob("*.txt")

这从上面被包裹在我们的... 函数中，并传递给nlp.pipe。输入生成器，输出生成器。这里batch_size=5是必填项，否则会落入先读取所有文件的坏习惯：

doc = nlp.pipe(path_iterator(my_paths), batch_size=5)

重要的部分，以及我们这样做的原因是，直到现在什么都没有发生。我们不是在等待处理一千个文件或任何事情。这仅在您从docs 开始阅读时按需发生：

for d in doc:
    print("A document!")

您将看到五个交替的块（我们的 batch_size，上面）“Yielding”和“A document”。它现在是一个实际的管道，数据在启动后很快就开始输出。

虽然我目前正在运行一个对于这个来说太旧的版本，但致命一击是多处理：

# For those with these new AMD CPUs with hundreds of cores
doc = nlp.pipe(path_iterator(my_paths), batch_size=5, n_process=64)

【讨论】：

【解决方案3】：

您一次只能读取一个文件。这是我通常对我的语料库文件做的事情：

import glob
import spacy
nlp = spacy.load("en_core_web_sm")
path = 'your path here\\*.txt'

for file in glob.glob(path):
    with open(file, encoding='utf-8', errors='ignore') as file_in:
        text = file_in.read()
        lines = text.split('\n')
        for line in lines:
            line = nlp(line)
            for token in line:
                print(token)

【讨论】：