【发布时间】:2021-08-19 03:27:22
【问题描述】:
我不知道如何进行预测,因为我的训练数据和测试数据不同,我不知道如何处理这些差异和缺失值。这是我的代码:
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id')
X_test = pd.read_csv('../input/test.csv', index_col='Id')
# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)
# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()]
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
train_size=0.8, test_size=0.2,
random_state=0)
### One-Hot Encoding
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]
# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
from sklearn.preprocessing import OneHotEncoder
# Make copy to avoid changing original data (when imputing)
X_train_new = X_train.copy()
X_valid_new = X_valid.copy()
# Apply one-hot encoder to low cardinality cols
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train_new[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid_new[low_cardinality_cols]))
# put back the index lost during One-hot encoding
OH_cols_train.index = X_train_new.index
OH_cols_valid.index = X_valid_new.index
# Remove categorical columns which we will replace with the ones one-hot encoded (object calls because
# we also want to remove the high cardinality cols)
num_X_train = X_train_new.drop(object_cols, axis=1)
num_X_valid = X_valid_new.drop(object_cols, axis=1)
# Add one-hot encoded cols to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
到目前为止,它仍然有效。但是当我想做出预测时,我失败了。这是我的代码:
# Make copy to avoid changing original data (when imputing)
X_train_new = X_train.copy()
X_valid_new = X_valid.copy()
# ????
X_test_new = X_test.copy()
# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X_test_new.columns if X_test_new[col].isnull().any()]
X_test_new.drop(cols_with_missing, axis=1, inplace=True)
# Categorical columns in the test data
new_object_cols = [col for col in X_test_new.columns if X_test_new[col].dtype == "object"]
# Columns that will be one-hot encoded
new_low_cardinality_cols = [col for col in new_object_cols if X_test_new[col].nunique() < 10]
# Columns that will be dropped from the dataset
new_high_cardinality_cols = list(set(new_object_cols)-set(new_low_cardinality_cols))
OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test_new[new_low_cardinality_cols]))
这是我的错误:
ValueError: The number of features in X is different to the number of features of the fitted data. The fitted data had 24 features and the X has 19 features.
【问题讨论】:
标签: python machine-learning scikit-learn