【问题标题】:sentence that appear the most using tfidf in my dataframe with python在我的 python 数据框中使用 tfidf 出现最多的句子
【发布时间】:2020-04-10 01:31:41
【问题描述】:

我想在我的数据框中查找使用 tfidf 出现最多的句子,我做了一些预处理作为标记化和停用词,现在我有 2 列(文本和停用词)

text                                                                   Stopword
bts jimin declared himself the worst player after his self sabotage    ['bts', 'jimin', 'declared','worst', 'player', 'self', 'sabotage']
bts ultra practical suga turned their game into an economy lesson      ['bts', 'ultra', 'practical', 'suga', 'turned', 'game', 'economy', 'lesson']
the mystery of bts sunflowers has finally been solved                  ['mystery', 'bts', 'sunflowers', 'finally', 'solved']

我想从 Stopword 列中获取带有句子的数据框,其值为 tf_idf,列是这样的单词

bts           tf_idf
mystery       tf_idf
suga          tf_idf
jimin         tf_idf
declared      tf_idf
worst         tf_idf
player        tf_idf
safe          tf_idf
sabotage      tf_idf
practical     tf_idf
turned        tf_idf
game          tf_idf
economy       tf_idf
lesson        tf_idf
sunflower     tf_idf
finally       tf_idf
solved        tf_idf

也许这里有人知道代码并可以帮助我?

【问题讨论】:

    标签: python csv dataframe tokenize tf-idf


    【解决方案1】:

    所以看起来tf-idf 有很多方程式。我不确定要使用哪一个,但一旦你决定我会做这样的事情:

    def tf_idf(word):
      # do stuff
      return stuff
    
    output = []
    for index, row in df.iterrows():
      for word in row:
        output.append([word, tf_idf(word)])
    
    output = pd.DataFrame(data=output, columns=["Word", "TF_IDF"])
    

    【讨论】:

      猜你喜欢
      • 2017-06-13
      • 2021-12-17
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-11-03
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多