如何将 CountVectorizer 应用于数据框中的每一行？答案

【问题标题】：How do I apply CountVectorizer to each row in a dataframe?如何将 CountVectorizer 应用于数据框中的每一行？
【发布时间】：2020-02-02 20:29:07
【问题描述】：

我有一个数据框 df，它有 3 列。 A 列和 B 列是一些字符串。 C 列是一个数值变量。 Dataframe

我想通过将其传递给 CountVectorizer 将其转换为特征矩阵。

我将我的 countVectorizer 定义为：

cv = CountVectorizer(input='content', encoding='iso-8859-1', 
                     decode_error='ignore', analyzer='word',
                    ngram_range=(1), tokenizer=my_tokenizer, stop_words='english',
                    binary=True)

接下来，我将整个数据帧传递给 cv.fit_transform(df)，但它不起作用。我收到此错误：无法解压不可迭代的 int 对象

接下来我将数据框的每一行转换为

sample = pdt_items["A"] + "," + pdt_items["C"].astype(str) + "," + pdt_items["B"]

那我申请

cv_m = sample.apply(lambda row: cv.fit_transform(row))

我仍然得到错误： ValueError：预期可迭代原始文本文档，收到字符串对象。

请让我知道我哪里出错了？或者我是否需要采取其他方法？

【问题讨论】：

您需要分享一些数据才能将其设为minimum reproducible example。我们不太了解您的 df 中的内容。
@mayosten 我添加了我的数据集的 sn-p 图像。谢谢！
datasetname 和 id 是索引。 JFYI。
@Shreya 我从我使用 NLP 的经验猜想你想要什么。试试 cv_m = sample.apply(lambda row: cv.fit_transform(row.to_string()))
@QuantStats 我收到以下错误 AttributeError: 'str' object has no attribute 'to_string'

标签： python pandas dataframe scikit-learn countvectorizer

【解决方案1】：

借助@QuantStats 的评论，我将 cv 应用于数据框的每一行，如下所示：

row_input = df['column_name'].tolist()

kwds = []
for i in range(len(row_input)):
  cell_input = [row_input[i]]
  full_set = row_keywords(cell_input, 1,1)
  candidates = [x for x in full_set if x[1]> 1] # to extract frequencies more than 1
  kwds.append(candidates)

kwds_col = pd.Series(kwds)
df['Keywords'] = kwds_col

("row_keywords" 是 CountVectorizer 的函数。)

【讨论】：

【解决方案2】：

试试这个：

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

A = ['very good day', 'a random thought', 'maybe like this']
B = ['so fast and slow', 'the meaning of this', 'here you go']
C = [1, 2, 3]

pdt_items = pd.DataFrame({'A':A,'B':B,'C':C})

cv = CountVectorizer()

# use pd.DataFrame here to avoid your error and add your column name    
sample = pd.DataFrame(pdt_items['A']+','+pdt_items['B']+','+pdt_items['C'].astype('str'), columns=['Output'])

vectorized = cv.fit_transform(sample['Output'])

【讨论】：

我尝试了两种方法：首先，我将 A、B 和 C 替换为数据框中的列名，并使用了代码示例 = pd.DataFrame(pdt_items['info']+', '+pdt_items['manufacturer']+','+pdt_items['price'].astype('str'), columns=['Output']) 我收到以下错误：TypeError: cannot unpack non-iterable int object接下来，我尝试了您实现的方式： A = [pdt_items['info']],B = [pdt_items['manufacturer']], C= [pdt_items['price']] test = pd.DataFrame({' A':A,'B':B,'C':C}) testsample = pd.DataFrame(test['A']+','+test['B']+','+test['C '].astype('str'), columns=['Output'])
最后 vectorized = cv.fit_transform(testsample['Output']) 我得到以下错误 AttributeError: 'Series' object has no attribute 'lower'
请原谅格式！评论部分不允许我在下一行输入代码。
@Shreya print(pdt_items['price']) 如果我知道它的样子，我可以帮助你。第一种方法更有可能成功。
@Shreya 实际上试试这个作为你的第二种方法，做 A=pdt_items['info'].to_list(), B=pdt_items['manufacturer'].to_list(),C=pdt_items['制造商'].to_list() 代替，然后继续。