【问题标题】:How to generate unigram, bigram and trigram from a large csv file and count their frequencies using nltk or pure python如何从大型 csv 文件生成 unigram、bigram 和 trigram 并使用 nltk 或纯 python 计算它们的频率
【发布时间】:2018-11-02 04:53:29
【问题描述】:
我使用此代码及其从给定文本生成一元、二元、三元的代码。但我想从大型 csv 文件的特定库中提取 unigram、bigram 和 trigram。请帮助我应该如何进行
【问题讨论】:
标签:
python-2.7
nltk
n-gram
【解决方案1】:
首先,一些花哨的代码来生成 DataFrame。
from io import StringIO
import pandas as pd
sio = StringIO("""I am just going to type up something because you inserted an image instead ctr+c and ctr+v the code to Stackoverflow.
Actually, it's unclear what you want to do with the ngram counts.
Perhaps, it might be better to use the `nltk.everygrams()` if you want a global count.
And if you're going to build some sort of ngram language model, then it might not be efficient to do it as you have done too.""")
with sio as fin:
texts = [line for line in fin]
df = pd.DataFrame({'text': texts})
然后您可以轻松地使用DataFrame.apply 来提取 ngram,例如
from collections import Counter
from functools import partial
from nltk import ngrams, word_tokenize
for i in range(1, 4):
_ngrams = partial(ngrams, n=i)
df['{}-grams'.format(i)] = df['text'].apply(lambda x: Counter(_ngrams(word_tokenize(x))))