【发布时间】:2021-05-21 14:03:30
【问题描述】:
我对我的测试和训练稀疏矩阵在执行相同的预处理后如何具有不同数量的特征感到困惑
这使我无法预测我的测试数据
def vectorizer(X):
vectorizer = CountVectorizer(stop_words = 'english')
vectorizer.fit(X)
X = vectorizer.fit_transform(X)
return X
other_features = ["n_steps", "n_ingredients"]
features = df_train[other_features]
test_features = df_test[other_features]
name = vectorizer(df_train.name)
steps = vectorizer(df_train.steps)
ingr = vectorizer(df_train.ingredients)
test_name = vectorizer(df_test.name)
test_steps = vectorizer(df_test.steps)
test_ingr = vectorizer(df_test.ingredients)
X = hstack([steps,ingr, name, np.array(features)])
X_test = hstack([test_steps, test_ingr, test_name, np.array(test_features)])
clf = LogisticRegression(C = 0.01, max_iter = 1000000, penalty = 'l2')
clf.fit(X, y)
predictions = clf.predict(X_test)
预测时出现的错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-40-1534a274c605> in <module>
----> 1 predictions = clf.predict(X_test)
/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_base.py in predict(self, X)
305 Predicted class label per sample.
306 """
--> 307 scores = self.decision_function(X)
308 if len(scores.shape) == 1:
309 indices = (scores > 0).astype(np.int)
/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_base.py in decision_function(self, X)
284 n_features = self.coef_.shape[1]
285 if X.shape[1] != n_features:
--> 286 raise ValueError("X has %d features per sample; expecting %d"
287 % (X.shape[1], n_features))
288
ValueError: X has 16417 features per sample; expecting 31765
【问题讨论】:
-
您正在为您的训练和测试数据分别制作一个新的矢量化器。我不知道你为什么认为它不会有所不同。向量化然后拆分。
-
训练和测试数据在不同的文件中,我必须在预测测试数据之前用训练数据训练模型
标签: python scikit-learn