【发布时间】:2019-07-11 04:00:57
【问题描述】:
需要知道需要进行哪些更改,以便测试数据将具有与训练相同级别的编码列以进行预测。它现在因尺寸错误而失败。
在论坛中查看类似查询..
import pandas as pd
import sklearn
from sklearn.linear_model import LinearRegression
# initialize list of lists
data = [[1001, 10,'Male',38], [2001, 15,'Male',50], [2004, 12,'FeMale',40]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['StudentId', 'Age','Gender','Weight'])
#Define y , X, test and train
y=df['Weight']
X=df[['StudentId','Age','Gender']]
# One-hot encode the data using pandas get_dummies
X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)
X_test.head()
----
StudentId Age Gender_FeMale Gender_Male
1 2001 15 0 1
---
# linear regression model creation
lm_model = LinearRegression()
lm_model.fit(X_train,y_train)
# predictions
lm_model.predict(X_test)
---works fine till now..--
When we now create a single test record and test it fails as the dimension mismatch happens,,, Does one have to manually add another encoded dimension or some clean approach is there...please advice..
sample_testdata=[[4001, 10,'FeMale']]
# Create the pandas DataFrame
sample_testDF= pd.DataFrame(sample_testdata, columns = ['StudentId', 'Age','Gender'])
sample_testDF_encoded=pd.get_dummies(sample_testDF)
-----
StudentId Age Gender_FeMale
0 4001 10 1
---
lm_model.predict(sample_testDF_encoded)
--Error----
ValueError: shapes (1,3) and (4,) not aligned: 3 (dim 1) != 4 (dim 0)
对单个测试记录的预测失败,因为 get_dummies 产生一个列...
【问题讨论】:
-
您需要传递一个包含多个记录的
sample_testdata,其中每个类别(男性/女性)至少应包含一个记录,get_dummies才能工作。 -
感谢您的宝贵时间。那是一个黑客。 sample_testdata 由用户提供。我不想强迫用户写出训练中使用的所有可能级别的分类值。任何其他想法..
-
在下面添加了一个替代答案。
标签: python scikit-learn linear-regression predict one-hot-encoding