在数据帧上使用 TfidfVectorizer答案

【问题标题】：Use of TfidfVectorizer on dataframe在数据帧上使用 TfidfVectorizer
【发布时间】：2023-03-03 12:49:01
【问题描述】：

我的数据框有 3 个列（正面评论、负面和得分）：

  negative                                        Positive               Label  
0 [there, were, issues, with, the, wifi, c]     [no, positive]             1  
1 [rooms, could, do, with, a, bit, of, a]   [the, well, meaning, staff]   2.5

我想在 DF 上应用 TfidfVectorizer。我写了以下代码。

from sklearn.feature_extraction.text import TfidfVectorizer  
df_x=train_df["Positive"]  
df_y=train_df["Score"]  
cv = TfidfVectorizer()   
df_xcv = cv.fit_transform(df_x)  
a=df_xcv.toarray()  
cv.get_feature_names()

这是一个错误：

AttributeError: 'list' object has no attribute 'lower'

为什么会抛出错误？

【问题讨论】：

请避免cross-posting（在多个网站上问同样的问题）。

标签： text-classification tfidfvectorizer

【解决方案1】：

您将pd.Series 对象传递给cv.fit_transform()，而不是字符串列表/系列。所以你可以做的是在每一行加入你的列表，然后将它们传递给 Vectorizer 方法：

df['joined_positive'] = df['Positive'].apply(' '.join)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['joined_positive'])

【讨论】：