如何从大型 csv 文件生成 unigram、bigram 和 trigram 并使用 nltk 或纯 python 计算它们的频率答案

【问题标题】：How to generate unigram, bigram and trigram from a large csv file and count their frequencies using nltk or pure python如何从大型 csv 文件生成 unigram、bigram 和 trigram 并使用 nltk 或纯 python 计算它们的频率
【发布时间】：2018-11-02 04:53:29
【问题描述】：

我使用此代码及其从给定文本生成一元、二元、三元的代码。但我想从大型 csv 文件的特定库中提取 unigram、bigram 和 trigram。请帮助我应该如何进行

【问题讨论】：

请不要使用图片来传达文字信息。 Edit 你的问题用相应的代码替换它们。此外，寻求调试帮助的问题（“为什么这段代码不起作用？”）必须包括所需的行为、特定的问题或错误以及在问题本身中重现它所需的最短代码。没有这个，你的问题是题外话，很可能被关闭。请构造一个Minimal, Complete, and Verifiable example 并包含它。

标签： python-2.7 nltk n-gram

【解决方案1】：

首先，一些花哨的代码来生成 DataFrame。

from io import StringIO

import pandas as pd

sio = StringIO("""I am just going to type up something because you inserted an image instead ctr+c and ctr+v the code to Stackoverflow.
Actually, it's unclear what you want to do with the ngram counts.
Perhaps, it might be better to use the `nltk.everygrams()` if you want a global count.
And if you're going to build some sort of ngram language model, then it might not be efficient to do it as you have done too.""")

with sio as fin:
    texts = [line for line in fin]

df = pd.DataFrame({'text': texts})

然后您可以轻松地使用DataFrame.apply 来提取 ngram，例如

from collections import Counter
from functools import partial

from nltk import ngrams, word_tokenize

for i in range(1, 4):
    _ngrams = partial(ngrams, n=i)
    df['{}-grams'.format(i)] = df['text'].apply(lambda x: Counter(_ngrams(word_tokenize(x))))

【讨论】：

注意：DataFrame.apply() 很慢。