Kaggle机器学习二级水平内容回顾1,2

一、处理缺失值

删除缺失值所在列，

# # delete columns with missing value
cols_with_missing = [col for col in X_train.columns
                                 if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test  = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

效果：简单粗暴
2. 填充缺失值，用该列数据的均值，

# # replace missing value with mean value in column with missing value
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()
imputed_X_train = my_imputer.fit_transform(X_train)   #默认均值填充缺失值，first fit_transform,second transform,
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))

效果：比1好，操作难度一般般

fit_transform,transform的作用详见 https://blog.csdn.net/weixin_38278334/article/details/82971752

添加缺失值拓展列，起标志作用

# 通过添加缺失值的标识列，但在这个例子中效果不太佳
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

cols_with_missing = (col for col in X_train.columns
                                 if X_train[col].isnull().any())
for col in cols_with_missing:
    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

效果：在有些数据集表现不错，但不稳定
完整代码示例见 https://github.com/firdameng/kaggle_ml/blob/master/handl_missing_value.py
参考：https://www.kaggle.com/dansbecker/handling-missing-values

二、对离散型数据one-hot编码

Kaggle机器学习二级水平内容回顾1,2
原始数据中的值为红色、黄色和绿色。我们为每个可能的值创建一个单独的列。当原始值是红色，我们在红色列中放置1。

one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)

pandas.get_dummies可以实现对离散型数据列one-hot编码，例如下图1,到图2的过程

完整代码见：https://github.com/firdameng/kaggle_ml/blob/master/one_hot.py