一、处理缺失值

  1. 删除缺失值所在列,
# # delete columns with missing value
cols_with_missing = [col for col in X_train.columns
                                 if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test  = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

效果:简单粗暴
2. 填充缺失值,用该列数据的均值,

# # replace missing value with mean value in column with missing value
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()
imputed_X_train = my_imputer.fit_transform(X_train)   #默认均值填充缺失值,first fit_transform,second transform,
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))

效果:比1好,操作难度一般般

fit_transform,transform的作用详见 https://blog.csdn.net/weixin_38278334/article/details/82971752

  1. 添加缺失值拓展列,起标志作用
# 通过添加缺失值的标识列,但在这个例子中效果不太佳
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

cols_with_missing = (col for col in X_train.columns
                                 if X_train[col].isnull().any())
for col in cols_with_missing:
    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

效果:在有些数据集表现不错,但不稳定
完整代码示例见 https://github.com/firdameng/kaggle_ml/blob/master/handl_missing_value.py
参考:https://www.kaggle.com/dansbecker/handling-missing-values

二、对离散型数据one-hot编码

Kaggle机器学习二级水平内容回顾1,2
原始数据中的值为红色、黄色和绿色。我们为每个可能的值创建一个单独的列。当原始值是红色,我们在红色列中放置1。

one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)

pandas.get_dummies可以实现对离散型数据列one-hot编码,例如下图1,到图2的过程
Kaggle机器学习二级水平内容回顾1,2
Kaggle机器学习二级水平内容回顾1,2
完整代码见:https://github.com/firdameng/kaggle_ml/blob/master/one_hot.py

相关文章:

  • 2021-09-04
  • 2021-04-27
  • 2021-12-31
  • 2021-12-06
  • 2021-12-29
  • 2021-06-08
  • 2022-12-23
  • 2022-02-06
猜你喜欢
  • 2021-12-29
  • 2021-07-01
  • 2021-10-22
  • 2021-04-07
  • 2021-06-25
  • 2021-10-16
  • 2021-08-15
相关资源
相似解决方案