TfidfVectorizer 为 Pandas DF 中具有重复 ID 的 Ngram 返回 0答案

【问题标题】：TfidfVectorizer Returning 0 for Ngrams in Pandas DF with Duplicate IDsTfidfVectorizer 为 Pandas DF 中具有重复 ID 的 Ngram 返回 0
【发布时间】：2018-02-03 14:33:37
【问题描述】：

我有一个分组的df：

id    text
100   he loves ice cream
100   she loves ice
100   i hate avocado

我正在使用这个函数提取二元组、频率和 tfidf 分数：

def extractFeatures(groupedDF, textCol):
    features = pd.DataFrame()
    for id, group in tqdm(groupedDF):
           freq = cv.fit_transform(group[textCol])
           tfidf = tv.fit_transform(group[textCol])
           freq = sum(freq).toarray()[0]
           tfidf.todense()
           tfidf = tfidf.toarray()[0]
           freq = pd.DataFrame(freq, columns=['frequency'])
           tfidf = pd.DataFrame(tfidf, columns=['tfidf'])
           dfinner = pd.DataFrame(cv.get_feature_names(), columns=['ngram'])
           dfinner['id'] = id
           dfinner = dfinner.join(freq)
           results = dfinner.join(tfidf)
           features = features.append(results)
    return features

这会产生以下 df：

id    ngram         frequency    tfidf
100   hate avocado  1            0
100   he loves      1            .3
100   i hate        1            0
100   ice cream     1            .3
100   loves ice     2            .6 
100   she loves     1            0

tfidf 分数是人为设计的。因此，该功能正确地找到了频率。然后它为分组 df 的第一行（包括出现在多行中的二元组）查找 tfidf 分数。最后，它没有为第二行和第三行唯一的二元组查找 tfidf 分数。

此外，虽然 tfidf 分数是人为设计的，但对于在特定文档中具有相同频率的任何二元组而言，它们确实是相同的。因此，第一行中频率为 1 的任何二元组都将具有 0.3 的 tfidf 分数。另一行中频率为 1 的任何二元组的 tfidf 分数可能为 0.24。这很奇怪，因为每个二元组的词频肯定不同。

两个问题：

为什么找不到第二行和第三行的 tfidf 分数？
为什么在特定文档中以相同频率出现的特定二元组的 tfidf 分数相同？

感谢大家提供的任何见解！

【问题讨论】：

标签： python python-3.x pandas scikit-learn tf-idf

【解决方案1】：

print(df)

    id  text
0   100 he loves ice cream
1   100 she loves ice
2   100 i hate avocado

TF-IDF 被计算为一个单词的重要性，它与一个文档中一个单词相对于文档其余部分的频率有关。如果要计算 TF-IDF，建议使用 scikit-learn TfidfVectorizer()

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(smooth_idf=True,
                             ngram_range = (2,2),
                             token_pattern='(?u)\\b\\w\\w*\\b'                                                     
                             )

words = vectorizer.fit_transform(df.text)

df2 = pd.DataFrame(words.todense()).rename(columns=dict(zip(vectorizer.vocabulary_.values(),
vectorizer.vocabulary_.keys())))

print(df2)

    hate avocado he loves   i hate      ice cream   loves ice   she loves
0   0.000000    0.622766    0.000000    0.622766    0.473630    0.000000
1   0.000000    0.000000    0.000000    0.000000    0.605349    0.795961
2   0.707107    0.000000    0.707107    0.000000    0.000000    0.000000

上面的矩阵给出了每个文档中每个单词的相对重要性，如果该单词没有出现在文档中，则它的值为零。

您也可以使用 scikit-learn CountVectorizer() 以同样的方式计算频率

【讨论】：