我会说您可以通过 train_test_split 分隔值,然后通过适当的度量在分类算法上训练这些值。
以下是我使用的东西(虽然用于回归问题),您可以根据自己的需要进行更改:
X = TRAIN_DS[["season", "holiday", "workingday", "weather", "weekday",
"month", "year", "hour", 'humidity', 'temperature']]
Y = TRAIN_DS['count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
estimators = [('randf', RandomForestRegressor(max_depth= 50, n_estimators= 1500)), ('gradb', GradientBoostingRegressor(max_depth= 5, n_estimators= 400)), ('gradb2',GradientBoostingRegressor(n_estimators= 4000)), ('svr', SVR('rbf',gamma='auto')), ('ext', ExtraTreesRegressor(n_estimators=4000))]
voting = StackingRegressor(estimators)
voting.fit(X = X_train, y = np.log1p(y_train))
对于最佳模型,我建议您使用适当的指标。这是一个您可能会发现有用的 RMSLE 和 R2 函数:
'''Calculating RMSLE score, r2 score as well as plotting'''
def calc_plot(y_test, y_pred, name):
# Removing negative values for i, y in enumerate
(y_pred): if y_pred[i] < 0: y_pred[i] = 0
# Printing scoring
print('RMSLE for ' + name + ':', np.sqrt(mean_squared_log_error(y_test, y_pred)))
print('R2 for ' + name + ':', r2_score(y_test, y_pred))
您也可以使用Voting Classifier 或Stacking Classifier 来使用多个模型进行预测。
最后,您可以使用GridSearchCV 检查您使用的分类算法参数的不同值。回归问题的一个例子如下:
gr = SGDRegressor()
parameters = {'loss':['squared_loss','huber','epsilon_insensitive','squared_epsilon_insensitive'], 'penalty':['l2','l1','elasticnet'],
'fit_intercept':[True,False], 'learning_rate':['constant','optimal','invscaling','adaptive'], 'alpha':[0.0001,0.005,0.001],
'l1_ratio':[0.15,0.5,0.25], 'max_iter':[500,1000,2000], 'epsilon':[0.1,0.4], 'eta0':[0.01,0.05,0.1], 'power_t':[0.25,0.1,0.5],
'early_stopping':[True,False], 'warm_start':[True,False],'average':[True,False], 'n_iter_no_change':[3,5,10,15]}
lModel = GridSearchCV(gr,parameters, cv=LeaveOneOut(), scoring = 'neg_mean_absolute_error')
希望对你有帮助!