【问题标题】:How to get tfidf with pandas dataframe?如何使用熊猫数据框获取 tfidf?
【发布时间】:2016-10-02 06:36:42
【问题描述】:

我想从下面的文档中计算 tf-idf。我正在使用 python 和 pandas。

import pandas as pd
df = pd.DataFrame({'docId': [1,2,3], 
               'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})

首先,我认为我需要为每一行获取 word_count。于是我写了一个简单的函数:

def word_count(sent):
    word2cnt = dict()
    for word in sent.split():
        if word in word2cnt: word2cnt[word] += 1
        else: word2cnt[word] = 1
return word2cnt

然后,我将它应用到每一行。

df['word_count'] = df['sent'].apply(word_count)

但现在我迷路了。如果我使用 Graphlab,我知道有一种简单的方法可以计算 tf-idf,但我想坚持使用开源选项。 Sklearn 和 gensim 都显得势不可挡。获取 tf-idf 最简单的解决方案是什么?

【问题讨论】:

    标签: python pandas scikit-learn tf-idf gensim


    【解决方案1】:

    Scikit-learn 的实现非常简单:

    from sklearn.feature_extraction.text import TfidfVectorizer
    v = TfidfVectorizer()
    x = v.fit_transform(df['sent'])
    

    您可以指定很多参数。参见文档here

    fit_transform 的输出将是一个稀疏矩阵,如果你想可视化它你可以做x.toarray()

    In [44]: x.toarray()
    Out[44]: 
    array([[ 0.64612892,  0.38161415,  0.        ,  0.38161415,  0.38161415,
             0.        ,  0.38161415],
           [ 0.        ,  0.38161415,  0.64612892,  0.38161415,  0.38161415,
             0.        ,  0.38161415],
           [ 0.        ,  0.38161415,  0.        ,  0.38161415,  0.38161415,
             0.64612892,  0.38161415]])
    

    【讨论】:

    • 假设我将 100 传递给 max_features 参数,语料库的原始词汇表是 1000。如何获取所选特征的名称并将它们映射到生成的矩阵?
    • v.get_feature_names() 将为您提供功能名称列表。 v.vocabulary_ 会给你一个dict,其中特征名称作为键,它们在矩阵中的索引作为值产生。
    • ja,但要小心打印 feature_names()。如果功能数量增加,您将遇到内存问题。
    【解决方案2】:

    我发现使用来自 sklearn 的 CountVectorizer 的方法略有不同。 --count 矢量化器:Ultraviolet Analysis word frequency --预处理/清理文本:Usman Malik scraping tweets preprocessing 我不会在这个答案中介绍预处理。基本上,您要做的是导入 CountVectorizer 并将您的数据拟合到 CountVectorizer 对象,这将让您访问 .vocabulary._items() 功能,这将为您提供数据集的词汇表(存在的唯一单词及其频率,给定您传递给 CountVectorizer 的任何限制参数,例如匹配特征编号等)

    然后,您将使用 Tfidtransformer 以类似的方式为术语生成 tf-idf 权重

    我正在使用 pandas 和 pycharm ide 在 jupyter notebook 文件中编码

    这是一个代码sn-p:

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    import numpy as np
    #https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
    countVec = CountVectorizer(max_features= 5000, stop_words='english', min_df=.01, max_df=.90)
    
    #%%
    #use CountVectorizer.fit(self, raw_documents[, y] to learn vocabulary dictionary of all tokens in raw documents
    #raw documents in this case will betweetsFrameWords["Text"] (processed text)
    countVec.fit(tweetsFrameWords["Text"])
    #useful debug, get an idea of the item list you generated
    list(countVec.vocabulary_.items())
    
    #%%
    #convert to bag of words
    #sparse matrix representation? (README: could use an edit/explanation)
    countVec_count = countVec.transform(tweetsFrameWords["Text"])
    
    #%%
    #make array from number of occurrences
    occ = np.asarray(countVec_count.sum(axis=0)).ravel().tolist()
    
    #make a new data frame with columns term and occurrences, meaning word and number of occurences
    bowListFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'occurrences': occ})
    print(bowListFrame)
    
    #sort in order of number of word occurences, most->least. if you leave of ascending flag should default ASC
    bowListFrame.sort_values(by='occurrences', ascending=False).head(60)
    
    #%%
    #now, convert to a more useful ranking system, tf-idf weights
    #TfidfTransformer: scale raw word counts to a weighted ranking using the
    #https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
    tweetTransformer = TfidfTransformer()
    
    #initial fit representation using transformer object
    tweetWeights = tweetTransformer.fit_transform(countVec_count)
    
    #follow similar process to making new data frame with word occurrences, but with term weights
    tweetWeightsFin = np.asarray(tweetWeights.mean(axis=0)).ravel().tolist()
    
    #now that we've done Tfid, make a dataframe with weights and names
    tweetWeightFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'weight': tweetWeightsFin})
    print(tweetWeightFrame)
    tweetWeightFrame.sort_values(by='weight', ascending=False).head(20)
    

    【讨论】:

      【解决方案3】:

      一个简单的解决方案是使用texthero:

      import texthero as hero
      df['tfidf'] = hero.tfidf(df['sent'])
      
      In [5]: df.head()
      Out[5]:
         docId                         sent                                              tfidf
      0      1   This is the first sentence  [0.3816141458138271, 0.6461289150464732, 0.381...
      1      2  This is the second sentence  [0.3816141458138271, 0.0, 0.3816141458138271, ...
      2      3   This is the third sentence  [0.3816141458138271, 0.0, 0.3816141458138271, ...
      

      【讨论】:

      • 这可能是最好最简单的方法。
      猜你喜欢
      • 2018-02-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-02-28
      • 1970-01-01
      • 1970-01-01
      • 2021-08-31
      • 2012-05-26
      相关资源
      最近更新 更多