【发布时间】:2020-03-12 02:09:42
【问题描述】:
我正在尝试使用 TF-IDF 计算消息传递数据帧的词频。到目前为止我有这个
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
new_group['tokenized_sents'] = new_group.apply(lambda row: nltk.word_tokenize(row['message']),axis=1).astype(str).lower()
vectoriser=TfidfVectorizer()
new_group['tokenized_vector'] = list(vectoriser.fit_transform(new_group['tokenized_sents']).toarray())
但是,在上面的代码中,我得到了一堆零而不是单词频率。如何解决此问题以获得消息的正确数字频率。这是我的数据框
user_id date message tokenized_sents tokenized_vector
X35WQ0U8S 2019-02-17 Need help ['need','help'] [0.0,0.0]
X36WDMT2J 2019-03-22 Thank you! ['thank','you','!'] [0.0,0.0,0.0]
【问题讨论】:
标签: python-3.x pandas word-frequency tfidfvectorizer