如何使用带有缺失值和分类变量的 scikit 进行预测答案

【问题标题】：How to make predictions using scikit with missing values and categorical variables如何使用带有缺失值和分类变量的 scikit 进行预测
【发布时间】：2021-08-19 03:27:22
【问题描述】：

我不知道如何进行预测，因为我的训练数据和测试数据不同，我不知道如何处理这些差异和缺失值。这是我的代码：

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

### One-Hot Encoding

# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))



from sklearn.preprocessing import OneHotEncoder


# Make copy to avoid changing original data (when imputing)
X_train_new = X_train.copy()
X_valid_new = X_valid.copy()

# Apply one-hot encoder to low cardinality cols
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train_new[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid_new[low_cardinality_cols]))


# put back the index lost during One-hot encoding
OH_cols_train.index = X_train_new.index
OH_cols_valid.index = X_valid_new.index

# Remove categorical columns which we will replace with the ones one-hot encoded (object calls because
# we also want to remove the high cardinality cols)
num_X_train = X_train_new.drop(object_cols, axis=1)
num_X_valid = X_valid_new.drop(object_cols, axis=1)

# Add one-hot encoded cols to numerical features

OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1) 
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

到目前为止，它仍然有效。但是当我想做出预测时，我失败了。这是我的代码：

# Make copy to avoid changing original data (when imputing)
X_train_new = X_train.copy()
X_valid_new = X_valid.copy()

# ????
X_test_new = X_test.copy()

# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X_test_new.columns if X_test_new[col].isnull().any()] 
X_test_new.drop(cols_with_missing, axis=1, inplace=True)

# Categorical columns in the test data
new_object_cols = [col for col in X_test_new.columns if X_test_new[col].dtype == "object"]

# Columns that will be one-hot encoded
new_low_cardinality_cols = [col for col in new_object_cols if X_test_new[col].nunique() < 10]

# Columns that will be dropped from the dataset
new_high_cardinality_cols = list(set(new_object_cols)-set(new_low_cardinality_cols))

OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test_new[new_low_cardinality_cols]))

这是我的错误：

ValueError: The number of features in X is different to the number of features of the fitted data. The fitted data had 24 features and the X has 19 features.

【问题讨论】：

标签： python machine-learning scikit-learn

【解决方案1】：

训练/测试数据中形状不匹配的一个原因是您在进行训练/测试拆分后创建了新的分类变量。最有可能发生的情况是，您有一些类别最终只出现在训练或测试拆分中，因此形状最终不匹配。

我会移动这条火车/测试分割线：

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

到您的预处理步骤结束以避免此问题，以便您以完全相同的方式预处理训练/测试数据。

【讨论】：