情绪分析逻辑回归的错误输入形状答案

【问题标题】：Bad input shape on sentiment analysis logistic regression情绪分析逻辑回归的错误输入形状
【发布时间】：2020-12-28 05:41:21
【问题描述】：

我想预测带有逻辑回归的情感分析模型的准确性，但出现错误：bad input shape（使用输入进行编辑）

数据框：

df
sentence                | polarity_label
new release!            | positive
buy                     | neutral
least good-looking      | negative

代码：

from sklearn.preprocessing import OneHotEncoder                                                   
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, 
ENGLISH_STOP_WORDS
# Define the set of stop words
my_stop_words = ENGLISH_STOP_WORDS
vect = CountVectorizer(max_features=5000,stop_words=my_stop_words)
vect.fit(df.sentence)
X = vect.transform(df.sentence)
y = df.polarity_label
encoder = OneHotEncoder()
encoder.fit_transform(y)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=123)
LogisticRegression(penalty='l2',C=1.0)

log_reg = LogisticRegression().fit(X_train, y_train)

错误信息

ValueError: Expected 2D array, got 1D array instead:
array=['Neutral' 'Positive' 'Positive' ... 'Neutral' 'Neutral' 'Neutral'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.```

How can I fix this?

【问题讨论】：

我对此不是 100% 确定，但请尝试 log_reg = Logistic_Regression() 然后 log_reg.fit(X_train, y_train)
很抱歉，但问题仍然存在，即使有那个版本..
可以添加x_train和y_train的形状吗？
您的 y 正在转换为向量，您可能希望将其保留为分类值，例如0 or 1
y_train.shape --> (14578,) X_train.shape --> (14578,385) @AniketBote

标签： python python-3.x scikit-learn logistic-regression

【解决方案1】：

我认为您需要将 y 标签转换为 One hot encoding，现在你的标签向量可能是这样的 [0,1,0,0,1,0]，但是对于逻辑回归，您需要将它们转换为这种形式 [[0,1],[1,0],[0,1],[0,1]]，因为在逻辑回归中，我们倾向于计算概率/似然所有课程。

您可以通过使用 sklearn onehotencoder 来做到这一点，

from sklearn.preprocessing import OneHotEncoder                                                   
encoder = OneHotEncoder()
encoder.fit_transform(y)

【讨论】：

ValueError: 预期 2D 数组，得到 1D 数组：array=['Neutral' 'Positive' 'Positive' ... 'Neutral' 'Neutral' 'Neutral']。如果您的数据具有单个特征，则使用 array.reshape(-1, 1) 重塑您的数据，如果它包含单个样本，则使用 array.reshape(1, -1) 。 -------------------------------------------------- -------------------------------------------------- ------------ y变量有三个值：'Positive'、'Neutral'、'Negative'
对于三个变量，一个热编码可以有两种可能 1. [0,0],[1,0],[0,1](这里你选择了"drop first"属性其中 [0,0] 被解释为 [0,0,1]。如果您将 'drop first' 保留为 False，您将得到类似 [0,0,1], [0,1,0] 的内容, [1,0,0]. 在将 onehotencoding 应用于标签向量“y”之后，您将得到一个 [ [0,0],[0,1],[0,0]... 形式的二维矩阵。 ..] 的形状 (len(y),2)。

【解决方案2】：

例如像这样调整你的代码：

y = df.polarity_label

当前，您正尝试使用您的 CountVectorizer 将您的 y 转换为一个向量，该 CountVectorizer 是在句子数据上训练的。

所以 CountVectorizer 有这个词汇表（你可以通过vect.get_feature_names() 得到它）：

['买'，'好'，'期待'，'新'，'发布']

并将包含这些单词的一些文本转换为向量。

但是当您在只有单词positive, neutral, negative 的 y 上使用它时，它找不到任何“已知”单词，因此您的 y 是空的。

如果您在转换后检查您的 y，您还可以看到它是空的：

<3x5 sparse matrix of type '<class 'numpy.int64'>'
    with 0 stored elements in Compressed Sparse Row format>

【讨论】：

y = df.polarity_label 之后，我需要再次将它们设为虚拟变量吗？仅使用y = df.polarity_label 即可，但准确率高达 99%。
您的原始问题已得到解答。与其编辑它以显示您的新问题，不如提出一个新问题。还要提供你的数据，因为没有它，你的问题很难调试。