将 csv 文件中的列加载到 spaCy答案

【问题标题】：Load column in csv file into spaCy将 csv 文件中的列加载到 spaCy
【发布时间】：2017-09-13 02:18:48
【问题描述】：

我是 spaCy 和 NLTK 整体的新手，所以如果这似乎是一个愚蠢的问题，我提前道歉。

基于 spaCy 教程，我必须使用以下命令将文本加载到文档中。

doc = nlp(u'Hello, world. Natural Language Processing in 10 lines of code.')

但是，我在 sql server 或 excel 上以表格格式存储了很多文本。它基本上有两列。第一列具有唯一标识符。第二列有一段简短的文字。

如何将它们加载到 spaCy 中？我是否需要将它们转换为 Numpy 数组或 Pandas 数据框，然后将其加载到文档中？

提前感谢您的帮助！

【问题讨论】：

标签： python pandas numpy nltk spacy

【解决方案1】：

我认为亚历克西斯使用pandas .apply() 的评论是最好的答案，这对我很有用：

import spacy 

df = pd.read_csv('doc filename.txt')
df['text_as_spacy_objects'] = df['text column name'].apply(nlp)

【讨论】：

【解决方案2】：

给定一个这样的 csv 文件：

$ cat test.tsv
DocID   Text    WhateverAnnotations
1   Foo bar bar dot dot dot
2   bar bar black sheep dot dot dot dot

$ cut -f2 test.tsv
Text
Foo bar bar
bar bar black sheep

在代码中：

$ python
>>> import pandas as pd
>>> pd.read_csv('test.tsv', delimiter='\t')
   DocID                 Text WhateverAnnotations
0      1          Foo bar bar         dot dot dot
1      2  bar bar black sheep     dot dot dot dot
>>> df = pd.read_csv('test.tsv', delimiter='\t')
>>> df['Text']
0            Foo bar bar
1    bar bar black sheep
Name: Text, dtype: object

在 spacy 中使用pipe：

>>> import spacy
>>> nlp = spacy.load('en')
>>> for parsed_doc in nlp.pipe(iter(df['Text']), batch_size=1, n_threads=4):
...     print (parsed_doc[0].text, parsed_doc[0].tag_)
... 
Foo NNP
bar NN

使用pandas.DataFrame.apply()：

>>> df['Parsed'] = df['Text'].apply(nlp)

>>> df['Parsed'].iloc[0]
Foo bar bar
>>> type(df['Parsed'].iloc[0])
<class 'spacy.tokens.doc.Doc'>
>>> df['Parsed'].iloc[0][0].tag_
'NNP'
>>> df['Parsed'].iloc[0][0].text
'Foo'

进行基准测试。

首先将行复制 200 万次：

$ cat test.tsv 
DocID   Text    WhateverAnnotations
1   Foo bar bar dot dot dot
2   bar bar black sheep dot dot dot dot

$ tail -n 2 test.tsv > rows2

$ perl -ne 'print "$_" x1000000' rows2 > rows2000000

$ cat test.tsv rows2000000 > test-2M.tsv

$ wc -l test-2M.tsv 
 2000003 test-2M.tsv

$ head test-2M.tsv 
DocID   Text    WhateverAnnotations
1   Foo bar bar dot dot dot
2   bar bar black sheep dot dot dot dot
1   Foo bar bar dot dot dot
1   Foo bar bar dot dot dot
1   Foo bar bar dot dot dot
1   Foo bar bar dot dot dot
1   Foo bar bar dot dot dot
1   Foo bar bar dot dot dot
1   Foo bar bar dot dot dot

[nlppipe.py]：

import time

import pandas as pd
import spacy


df = pd.read_csv('test-2M.tsv', delimiter='\t')
nlp = spacy.load('en')

start = time.time()
for parsed_doc in nlp.pipe(iter(df['Text']), batch_size=1000, n_threads=4):
    x = parsed_doc[0].tag_
print (time.time() - start)

[dfapply.py]：

import time

import pandas as pd
import spacy


df = pd.read_csv('test-2M.tsv', delimiter='\t')
nlp = spacy.load('en')

start = time.time()
df['Parsed'] = df['Text'].apply(nlp)

for doc in df['Parsed']:
    x = doc[0].tag_
print (time.time() - start)

【讨论】：

如果您说明每个人的时间安排，这样我们就可以看到它们之间的比较，这将很有用。即使实际时间与机器不同

【解决方案3】：

这应该很简单——您可以使用任何您想从数据库中读取文本的方法（Pandas 数据框、CSV 阅读器等），然后对其进行迭代。

这最终取决于你想要做什么以及你想如何处理你的文本——如果你想单独处理每个文本，只需逐行迭代你的数据：

for id, line in text:
    doc = nlp(line)
    # do something with each text

或者，您也可以将文本连接成一个字符串并将它们作为一个文档处理：

text = open('some_large_text_file.txt').read()
doc = nlp(text)

有关更高级的用法示例，请参阅 this code snippet of streaming input and output 使用 pipe()。

【讨论】：

但如果读入数据帧，则可以使用df.apply() 或等效项将行输入nlp，而不是迭代。