【问题标题】:data frame of tfidf with Pythontfidf 与 Python 的数据框
【发布时间】:2017-06-13 17:43:33
【问题描述】:

我必须对一些情绪进行分类,我的数据框是这样的

Phrase                      Sentiment    
is it  good movie          positive    
wooow is it very goode      positive    
bad movie                  negative

我做了一些预处理,如标记化停止词词干等......我得到了

Phrase                      Sentiment    
[ good , movie  ]        positive    
[wooow ,is , it ,very, good  ]   positive 
[bad , movie ]            negative

我最终需要得到一个数据框,其中行是文本,值为 tf_idf,列是这样的词

good     movie   wooow    very      bad                Sentiment
tf idf    tfidf_  tfidf    tf_idf    tf_idf               positive
(same thing for the 2 remaining lines)

【问题讨论】:

    标签: python pandas dataframe text-mining tf-idf


    【解决方案1】:

    我会使用专为此类任务设计的sklearn.feature_extraction.text.TfidfVectorizer

    演示:

    In [63]: df
    Out[63]:
                       Phrase Sentiment
    0       is it  good movie  positive
    1  wooow is it very goode  positive
    2               bad movie  negative
    

    解决方案:

    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    
    vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
    
    X = vect.fit_transform(df.pop('Phrase')).toarray()
    
    r = df[['Sentiment']].copy()
    
    del df
    
    df = pd.DataFrame(X, columns=vect.get_feature_names())
    
    del X
    del vect
    
    r.join(df)
    

    结果:

    In [31]: r.join(df)
    Out[31]:
      Sentiment  bad  good     goode     wooow
    0  positive  0.0   1.0  0.000000  0.000000
    1  positive  0.0   0.0  0.707107  0.707107
    2  negative  1.0   0.0  0.000000  0.000000
    

    更新:节省内存的解决方案:

    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    
    vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
    
    X = vect.fit_transform(df.pop('Phrase')).toarray()
    
    for i, col in enumerate(vect.get_feature_names()):
        df[col] = X[:, i]
    

    UPDATE2: related question where the memory issue was finally solved

    【讨论】:

    • 我在这一行有问题内存错误 r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
    • @AmalKostaliTarghi,我已经更新了我的答案 - 请检查它是否有帮助
    • 我认为问题出在 df = pd.DataFrame(X.toarray(), columns=vect.get_feature_names()) 因为我的语料库包含超过 10000 行并且在应用 tfidf THE MATRIx 后变成 156060x11780所以我认为那里有问题
    • @AmalKostaliTarghi,是的,156060*11780*8/1024**3 大约是。 14GiB 并且在调用DataFrame() 构造函数时复制了很短的时间
    • @AmalKostaliTarghi,我可以想象一个丑陋的解决方案 - 在循环中添加列。它会更慢,但应该可以节省内存
    【解决方案2】:

    设置

    df = pd.DataFrame([
            [['good', 'movie'], 'positive'],
            [['wooow', 'is', 'it', 'very', 'good'], 'positive'],
            [['bad', 'movie'], 'negative']
        ], columns=['Phrase', 'Sentiment'])
    
    df
    
                            Phrase Sentiment
    0                [good, movie]  positive
    1  [wooow, is, it, very, good]  positive
    2                 [bad, movie]  negative
    

    计算term frequency tf

    # use `value_counts` to get counts of items in list
    tf = df.Phrase.apply(pd.value_counts).fillna(0)
    print(tf)
    
       bad  good   is   it  movie  very  wooow
    0  0.0   1.0  0.0  0.0    1.0   0.0    0.0
    1  0.0   1.0  1.0  1.0    0.0   1.0    1.0
    2  1.0   0.0  0.0  0.0    1.0   0.0    0.0
    

    计算inverse document frequency idf

    # add one to numerator and denominator just incase a term isn't in any document
    # maximum value is log(N) and minimum value is zero
    idf = np.log((len(df) + 1 ) / (tf.gt(0).sum() + 1))
    idf
    
    bad      0.693147
    good     0.287682
    is       0.693147
    it       0.693147
    movie    0.287682
    very     0.693147
    wooow    0.693147
    dtype: float64
    

    tfidf

    tdf * idf
    
            bad      good        is        it     movie      very     wooow
    0  0.000000  0.287682  0.000000  0.000000  0.287682  0.000000  0.000000
    1  0.000000  0.287682  0.693147  0.693147  0.000000  0.693147  0.693147
    2  0.693147  0.000000  0.000000  0.000000  0.287682  0.000000  0.000000
    

    【讨论】:

      猜你喜欢
      • 2018-02-08
      • 1970-01-01
      • 1970-01-01
      • 2017-11-03
      • 2017-03-27
      • 1970-01-01
      • 1970-01-01
      • 2016-10-02
      • 2014-09-28
      相关资源
      最近更新 更多