本文目录
1.数据预处理
2.特征构建
3.特征选择
4.LightGBM模型构建
5.自动调参方法
一.数据预处理
1.1 离群点处理
Tukey Method:
一种利用数据四分位差的检测方法。通过计算特征的 IQR 四分位差,得到 outlier_step=1.5*IQR,如果值大于(上四分位数+outlier_step)或者小于(下四分位数-outlier_step),就判定这个值为离群点。为减小盲目删除样本后带来的信息损失,设定阈值 n,如果一个样本中出现离群点个数大于 n,我们就可以删除这个样本。
from collections import Counter def detect_outliers(df,n,features): outlier_indices=[] #迭代每一个特征 for col in features: #计算四分位数 Q1=np.percentile(df[col],25) Q3=np.percentile(df[col],75) #计算IQR Interquartile range 四分位差 IQR=Q3-Q1 #Outlier Step outlier_step=1.5*IQR #判断每个特征内的离群点 outlier_index=df[(df[col] > Q3+outlier_step) | (df[col] < Q1-outlier_step)].index outlier_indices.extend(outlier_index) #只有当n个以上的特征出现离群现象时,这个样本点才被判断为离群点 outlier_dict=Counter(outlier_indices) #统计样本点被判断为离群的次数,并返回一个字典 outlier_indices=[k for k,v in outlier_dict.items() if v > n] return outlier_indices outlier_index=detect_outliers(train,2,[\'Age\',\'SibSp\',\'Parch\',\'Fare\']) #删除离群点 train=train.drop(outlier_index,axis=0).reset_index(drop=True)
其他方法:
- EDA:箱型图定性分析,pandas的describe函数定量分析
- 分箱操作(连续特征离散化)
- 近似服从正态分布的特征可以用3sigma原则
- 平均值或中位数替代异常点,简单高效,信息的损失较少
- 优先使用树模型,因为在训练树模型时,树模型对离群点的鲁棒性较高,无信息损失,不影响模型训练效果
1.2 缺失值处理
(1)统计量填充:直接填充出现众数,均值,最值等
对于连续变量
若缺失率较低(小于95%)且重要性较低,则根据数据分布的情况进行填充。
- 对于数据近似符合正态分布,用该变量的均值填补缺失。
- 对于数据存在偏态分布的情况,采用中位数进行填补。
#绘制与目标变量相关的特征分布图 def plotContinuousVar(df,col,TARGET): g=sns.kdeplot(df[col][df[TARGET]==0],color=\'red\') g=sns.kdeplot(df[col][df[TARGET]==1],ax=g,color=\'green\') g.set_xlabel(col) g.set_ylabel(\'Frequency\') g=g.legend([\'0\',\'1\']) plotContinuousVar(train,\'Fare\',\'Survived\')
如何区别正态分布和偏态分布,正态分布是对称的,而偏态分布一般是不对称的左偏或者右偏
#通过数据可视化发现,Fare票价特征是偏态分布,故采用中位数填充 train[\'Fare\']=train[\'Fare\'].fillna(train[\'Fare\'].median())
对于离散变量:
- 可以直接把缺失值作为一个属性,例如设置为None,后续用one-hot或者label-encodeing处理。
- 如果缺失少,可以用众数填充
#由于缺失值只有2个,所有选用出现次数最多的值填充 train_data[\'Embarked\']=train_data[\'Embarked\'].fillna(\'S\')
(2)模型填充:将需要填充的缺失特征作为label,其他相关特征用作训练特征
from sklearn.ensemble import RandomForestRegressor #使用随机森林填补age缺失值 def set_missing_ages(df): #把已有的数值型特征取出来丢进随机森林中 num_df=df[[\'Age\',\'Fare\',\'Parch\',\'SibSp\',\'Pclass\',\'Title\']] #把乘客分成已知年龄和未知年龄两部分 know_age=num_df[num_df.Age.notnull()].as_matrix() unknow_age=num_df[num_df.Age.isnull()].as_matrix() #y即目标年龄 y=know_age[:,0] #X即特征属性值 X=know_age[:,1:] rfr=RandomForestRegressor(n_estimators=100,random_state=0,n_jobs=-1) rfr.fit(X,y) #用拟合好的模型来预测 y_pre=rfr.predict(unknow_age[:,1:]) #用得到的预测值来填补原缺失数据 df.loc[(df.Age.isnull()),\'Age\']=y_pre return df train=set_missing_ages(train)
(3)XGBoost(LightGBM)自动填充缺失值
1.3 one-hot编码:处理分类变量
def one_hot_encoder(df, nan_as_category=True): original_columns = list(df.columns) categorical_columns = [col for col in df.columns if df[col].dtype == \'object\'] df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category) new_columns = [c for c in df.columns if c not in original_columns] return df, new_columns
1.4 log和box-cox转换
log对数转换主要针对线性模型,让偏态分布的特征转换为近似正态分布,满足模型假设条件。
#注明偏斜度 g=sns.distplot(train[\'Fare\'],color=\'m\',label=\'skewness:%.2f\'%(train[\'Fare\'].skew())) g=g.legend(loc=\'best\')
#用log函数来处理Fare分布函数skewed的情况 train[\'Fare\']=train[\'Fare\'].map(lambda i:np.log(i) if i>0 else 0) g=sns.distplot(train[\'Fare\'],color=\'m\',label=\'skewness:%.2f\'%(train[\'Fare\'].skew())) g=g.legend(loc=\'best\')
Box-Cox 转换:Box 和 Cox 下(1964)提出了一种用一个参数λ进行索引的变换族:
相比 log 转换,这个变换族还包括平方变换,平方根变换,倒数变换已经在此之间的变换。就使用案例来说,Box-Cox 变换更加直接,更少遇到计算问题,而且对于预测变量同样有效。
from scipy.special import boxcox1p lam=0.15 dataset[feat]=boxcox1p(dataset[feat],lam)
二.特征构造
特征构建主要依据业务理解,新构建的特征可以加强对label的预测能力。举个例子,现在有个简单的二分类问题,要求使用逻辑回归训练一个身材分类器。输入数据 X 有身高和体重,标签 Y 则是胖或者不胖。根据经验,我们不能仅仅依靠体重来判断一个人是否胖。对于这个任务,一个非常经典的特征构造是,构造 BMI 指数,BMI=体重/身高的平方。通过 BMI 指数,就能更好地帮助我们刻画一个人的身材信息。这里仅仅总结了一些通用构建思路,具体业务需要具体对待。
2.1 多项式特征(Polynomial Features)
示例:对于特征x,y,其衍生的二阶多项式特征包括,x^2,y^2,xy
通过特征的乘积,引入特征与特征之间的交互作用,从而引入非线性
下面代码以Kaggle Home Credit Default Risk数据集为例,实现具有通用性
#将用来交叉的特征 定义为新的dataframe poly_features=dataset[[\'EXT_SOURCE_1\', \'EXT_SOURCE_2\', \'EXT_SOURCE_3\']] from sklearn.preprocessing import PolynomialFeatures poly_transformer=PolynomialFeatures(degree=3) poly_transformer.fit(poly_features) poly_features=poly_transformer.transform(poly_features) print(poly_features.shape) #(355912, 20) #查看衍生的多项式特征 print(\'Polynomial feature names:\n{}\'.format(poly_transformer.get_feature_names( [\'EXT_SOURCE_1\', \'EXT_SOURCE_2\', \'EXT_SOURCE_3\']))) #Polynomial feature names: #[\'1\', \'EXT_SOURCE_1\', \'EXT_SOURCE_2\', \'EXT_SOURCE_3\', \'EXT_SOURCE_1^2\', \'EXT_SOURCE_1 EXT_SOURCE_2\',
\'EXT_SOURCE_1 EXT_SOURCE_3\', \'EXT_SOURCE_2^2\', \'EXT_SOURCE_2 EXT_SOURCE_3\', \'EXT_SOURCE_3^2\', \'EXT_SOURCE_1^3\',
\'EXT_SOURCE_1^2 EXT_SOURCE_2\', \'EXT_SOURCE_1^2 EXT_SOURCE_3\', \'EXT_SOURCE_1 EXT_SOURCE_2^2\', \'EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3\',
\'EXT_SOURCE_1 EXT_SOURCE_3^2\', \'EXT_SOURCE_2^3\', \'EXT_SOURCE_2^2 EXT_SOURCE_3\', \'EXT_SOURCE_2 EXT_SOURCE_3^2\', \'EXT_SOURCE_3^3\']
2.2 统计特征
这里用Kaggle Home Credit Default Risk贷款违约预测比赛的数据集作为例子
申请贷款的客户信息表applicationtrain.csv,原始特征122,本文为了便于展示,只选取其中9个特征。同时,比赛还有其余四个辅助表,这里只取其中的信用卡数据表。其中SK_ID_CURR是在applicationtrain.表唯一的,但在credit_card_balance表是不唯一的,所以需要进行聚合。
表一:客户信息表application_train SK_ID_CURR 客户ID TARGET 标签 0/1 NAME_CONTRACT_TYPE 贷款类型(周转或现金) CODE_GENDER 性别 FLAG_OWN_CAR 是否有车 FLAG_OWN_REALTY 是否有房 CNT_CHILDREN 孩子数量 AMT_INCOME_TOTAL 年收入 AMT_CREDIT 贷款金额 表二:信用卡数据表credit_card_balance SK_ID_CURR 客户ID MONTHS_BALANCE 相对于申请日期而言的余额月份(-1是指最新的余额日期) AMT_BALANCE 上一次贷款月余额 ...
针对不同特征进行不同的统计聚合操作
time_agg={ \'CC_PAID_LATE\': [\'mean\', \'sum\'], \'CC_PAID_LATE_WITH_TOLERANCE\':[\'mean\', \'sum\'], \'AMT_CREDIT_LIMIT_ACTUAL\':[\'mean\', \'var\',\'median\'], \'AMT_LAST_DEBT\':[\'mean\',\'var\'], \'DRAWINGS_ATM_RATIO\':[\'mean\',\'var\',\'max\',\'min\',\'median\'], \'DRAWINGS_POS_RATIO\':[\'mean\',\'var\',\'max\',\'min\',\'median\'], \'RECEIVABLE_PAYMENT_RATIO\':[\'mean\',\'var\',\'max\',\'min\',\'median\'], \'AMT_PAYMENT_TOTAL_CURRENT\':[\'mean\',\'var\',\'median\'], \'AMT_INST_MIN_REGULARITY\':[\'mean\',\'median\'], \'CNT_INSTALMENT_MATURE_CUM\':[\'max\'], \'AMT_DRAWINGS_POS_CURRENT\':[\'mean\'] } cc_agg = cc.groupby(\'SK_ID_CURR\').agg(time_agg) #重命名 cc_agg.columns = pd.Index([\'CC_\' + e[0] + \'_\' + e[1].upper() +\'_\'+\'75\' for e in cc_agg.columns.tolist()])
2.3 时间特征:将秒转化为分钟和小时
#时间特征处理 timedelta = pd.to_timedelta(df[\'Time\'], unit=\'s\') df[\'Minute\'] = (timedelta.dt.components.minutes).astype(int) df[\'Hour\'] = (timedelta.dt.components.hours).astype(int)
2.4 频率特征:针对特征值数较多的离散特征,反应特征值频数分布
def frequence_encoding(df, feature): #计算 特征值频数 / 总样本数 freq_dict = dict(df[feature].value_counts() / df.shape[0]) new_feature = feature + \'_freq\' df[new_feature] = df[feature].map(freq_dict) return df
三.特征选择
特征选择的方法有方差选择,皮尔逊相关系数,互信息,正则化等。由于树模型的广泛使用,基于树模型的特征重要性排序是一种高效常用的方法。然而模型得到的特征重要性存在一定的偏差,这些往往对特征选择产生干扰。这里介绍Kaggle中有人用过的一种特征选择方法-PIMP算法(Permutation Importance),它的主要思想是修正已有的特征重要性。具体算法描述如下:
1.打乱标签的排序,得到新的训练集,重新训练并评估特征重要性。
2.重复第一步n次,得到每个特征进行多次评估的特征重要性集合,我们称之为thenull importance.
3.计算标签真实排序时,模型训练得到的特征重要性。
4.利用第二步得到的集合,对每个特征计算修正得分。
5.修正规则如下:Corrected_gain_score=100*
(null_importance_gain<np.percentile(actual_imp,25)).sum()/null_importance.size()
def get_feature_importance(data,shuffle,seed=None): #获取有用的特征 删除target和一些id trian_features=[f for f in data if f not in [\'TARGET\',\'SK_ID_CURR\']] #shuffle y=data[\'TARGET\'].copy() if shuffle: y=data[\'TARGET\'].copy().sample(frac=1.0) #随机抽样 dtrain=lgb.Dataset(data[train_features],y,free_raw_data=False,silent=True) lgb_params={ \'objective\':\'binary\', \'boosting_type\':\'rf\', \'subsample\':0.623, \'colsample_bytree\':0.7, \'num_leaves\':127, \'max_depth\':8, \'seed\':seed, \'bagging_freq\':1, \'n_jobs\':4 } clf=lgb.train(params=lgb_params,train_set=dtrain,num_boost_round=200,categorical_feature=categorical_feats) imp_df=pd.DataFrame() imp_df[\'feature\']=list(train_features) imp_df[\'importance_gain\']=clf.feature_importance(importance_type=\'gain\') imp_df[\'train_score\']=roc_auc_score(y,clf.predict(data[train_features])) return imp_df #获取真实目标排序的feature_importance actual_imp_df=get_feature_importance(data=data,shuffle=False) #获取n次打乱target后的feature_importance null_imp_df=pd.DataFrame() nb_runs=30 #打乱运行的次数 for i in range(nb_runs): imp_df=get_feature_importance(data=data,shuffle=True) imp_df[\'run\']=i+1 null_imp_df=pd.concat([null_imp_df,imp_df],axis=0) #计算修正后的feature_scores feature_scores=[] for feature in actual_imp_df[\'feature\'].unique(): f_null_imps_gain=null_imp_df.loc[null_imp_df[\'feature\']==feature,\'importance_gain\'].values f_act_imps_gain=actual_imp_df.loc[actual_imp_df[\'feature\']==feature,\'importance_gain\'].values.mean() corrected_gain_score=np.log(1e-10+f_act_imps_gain/(1+np.percentile(f_null_imps_gain,75))) feature_scores.append((feature,corrected_gain_score)) #用不同的阈值来筛选特征 for threshold in [0,10,20,30,40,50,60,70,80,90,95,99]: gain_feats=[feature for feature,score in feature_scores if score>=threshold]
四.LightGBM模型构建
#以Kaggle Home Credit Default Risk数据集为例,代码实现具有通用性 def kfold_lightgbm(df, num_folds, stratified=False, debug=False): train_df = df[df[\'TARGET\'].notnull()] test_df = df[df[\'TARGET\'].isnull()] print("Starting LightGBM. Train shape: {}, test shape: {}".format(train_df.shape, test_df.shape)) del df gc.collect() if stratified: folds = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=1001) else: folds = KFold(n_splits=num_folds, shuffle=True, random_state=1001) oof_preds = np.zeros(train_df.shape[0]) sub_preds = np.zeros(test_df.shape[0]) feature_importance_df = pd.DataFrame() #删除label或者id型数据 feats = [f for f in train_df.columns if f not in [\'TARGET\', \'SK_ID_CURR\', \'SK_ID_BUREAU\', \'SK_ID_PREV\', \'index\']] #五折交叉验证 for n_fold, (train_idx, valid_idx) in enumerate(folds.split(train_df[feats], train_df[\'TARGET\'])): dtrain = lgb.Dataset(data=train_df[feats].iloc[train_idx], label=train_df[\'TARGET\'].iloc[train_idx], free_raw_data=False, silent=True) dvalid = lgb.Dataset(data=train_df[feats].iloc[valid_idx], label=train_df[\'TARGET\'].iloc[valid_idx], free_raw_data=False, silent=True) # LightGBM parameters found by Bayesian optimization params = { \'objective\': \'binary\', \'boosting_type\': \'gbdt\', \'nthread\': 4, \'learning_rate\': 0.02, # 02, \'num_leaves\': 20, \'colsample_bytree\': 0.9497036, \'subsample\': 0.8715623, \'subsample_freq\': 1, \'max_depth\': 8, \'reg_alpha\': 0.041545473, \'reg_lambda\': 0.0735294, \'min_split_gain\': 0.0222415, \'min_child_weight\': 60, # 39.3259775, \'seed\': 0, \'verbose\': -1, \'metric\': \'auc\', } clf = lgb.train( params=params, train_set=dtrain, num_boost_round=13000, valid_sets=[dtrain, dvalid], early_stopping_rounds=200, verbose_eval=False ) oof_preds[valid_idx] = clf.predict(dvalid.data) sub_preds += clf.predict(test_df[feats]) / folds.n_splits fold_importance_df = pd.DataFrame() fold_importance_df["feature"] = feats fold_importance_df["importance"] = clf.feature_importance(importance_type=\'gain\') fold_importance_df["importance"] = clf.feature_importance(importance_type=\'gain\') fold_importance_df["fold"] = n_fold + 1 feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0) print(\'Fold %2d AUC : %.6f\' % (n_fold + 1, roc_auc_score(dvalid.label, oof_preds[valid_idx]))) del clf, dtrain, dvalid gc.collect() print(\'Full AUC score %.6f\' % roc_auc_score(train_df[\'TARGET\'], oof_preds)) #输出测试集预测结果,同时绘制特征重要性图 if not debug: sub_df = test_df[[\'SK_ID_CURR\']].copy() sub_df[\'TARGET\'] = sub_preds sub_df[[\'SK_ID_CURR\', \'TARGET\']].to_csv(submission_file_name, index=False) display_importances(feature_importance_df) feature_importance_df= feature_importance_df.groupby(\'feature\')[\'importance\'].mean().reset_index().rename(index=str,columns={\'importance\':\'importance_mean\'}) feature_importance_df.to_csv(\'feature_importance.csv\') print(feature_importance_df.sort_values(\'importance_mean\',ascending=False)[500:]) return feature_importance_df #绘制特征重要性图函数 def display_importances(feature_importance_df_): cols = feature_importance_df_[["feature", "importance"]].groupby("feature").mean().sort_values(by="importance",ascending=False)[:40].index best_features = feature_importance_df_.loc[feature_importance_df_.feature.isin(cols)] plt.figure(figsize=(8, 10)) sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False)) plt.title(\'LightGBM Features (avg over folds)\') plt.tight_layout() plt.savefig(\'lgbm_importances01.png\')
五.自动调参方法
5.1 网格搜索:穷举参数的所有组合,选择最优解
from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve kfold = StratifiedKFold(n_splits=10) tree=DecisionTreeClassifier(random_state=0) tree_param_grid={ \'max_depth\':[7,8,9,11,12,13,14], \'min_samples_leaf\':[2,3,4,5,6,7,8,9], \'min_samples_split\':[2,3,4,5,6,7,8,9] } tree_grid_search=GridSearchCV(tree,param_grid=tree_param_grid,cv=kfold,scoring=\'accuracy\',n_jobs=-1) tree_grid_search.fit(X_train,y_train) print(\'Best parameters:{}\'.format(tree_grid_search.best_params_)) #Best parameters:{\'max_depth\': 7, \'min_samples_leaf\': 2, \'min_samples_split\': 2} print(\'Best cv score:{}\'.format(tree_grid_search.best_score_)) #Best cv score:0.8149829738933031 print(\'Accuracy training set:{}\'.format(tree_grid_search.score(X_train,y_train))) #Accuracy training set:0.8683314415437003
5.2 贝叶斯优化
相比网格搜索会穷举所有可能结果,贝叶斯调参考虑了之前的参数信息,不断调整当前参数
#feats是df中用于训练的有效特征 def kfold_lightgbm(train_df,feats,params): oof_preds = np.zeros(train_df.shape[0]) for n_fold, (train_idx, valid_idx) in enumerate(folds.split(train_df[feats], train_df[\'Class\'])): dtrain = lgb.Dataset(data=train_df[feats].iloc[train_idx], label=train_df[\'Class\'].iloc[train_idx], free_raw_data=False, silent=True) dvalid = lgb.Dataset(data=train_df[feats].iloc[valid_idx], label=train_df[\'Class\'].iloc[valid_idx], free_raw_data=False, silent=True) clf = lgb.train( params=params, train_set=dtrain, num_boost_round=1000, valid_sets=[dtrain, dvalid], early_stopping_rounds=20, feval=average_precision_score_vali, verbose_eval=False ) oof_preds[valid_idx] = clf.predict(dvalid.data) del clf, dtrain, dvalid gc.collect() return average_precision_score(train_df[\'Class\'], oof_preds) #objective function def lgb_objective(params,n_folds=5): loss = -kfold_lightgbm(params) return {\'loss\':-loss,\'params\':params,\'status\':STATUS_OK} #定义搜索空间 space = { \'objective\':\'regression\', \'boosting_type\': \'gbdt\', \'subsample\':0.8, \'colsample_bytree\':hp.uniform(\'colsample_bytree\',0.8,0.9), \'max_depth\':7, \'learning_rate\':0.01, "lambda_l1":hp.uniform(\'lambda_l1\',0.0,0.2), \'seed\':0, } #定义优化算法 tpe_algorithm = tpe.suggest best = fmin(fn = lgb_objective,space = space,algo=tpe_algorithm,max_evals=50) print(best) result=space_eval(space, best) print(space_eval(space, best))