【问题标题】:Light GBM Regression CV Interpreting Results光 GBM 回归 CV 解释结果
【发布时间】:2021-07-03 18:43:01
【问题描述】:

我查看了文档,但找不到我的问题的答案,希望这里有人知道。 下面是一些示例代码:

N_FOLDS= 5

model = lgb.LGBMClassifier()
default_params = model.get_params()

#overwriting a param
default_params['objective'] = 'regression'

cv_results = lgb.cv(default_params, train_set, num_boost_round = 100000, nfold = N_FOLDS, 
                    early_stopping_rounds = 100, metrics = 'rmse', seed = 50, stratified=False)

我得到一个这样的字典,其中包含 6 个不同的值:

{'rmse-mean': [635.2078190031074,
  632.0847253839236,
  629.6661071275558,
  627.9721515847672,
  626.6712284533291,
  625.293530527769],
 'rmse-stdv': [197.5088741303537,
  198.66960690389863,
  199.56134068525006,
  200.25929541235243,
  200.8251430042537,
  201.50213772830526]}

起初,我认为该字典中的值对应于每个折叠的 RMSE(在本例中为 5),但似乎并非如此。字典看起来像是 RMSE 值的递减。

有谁知道每个值对应什么?

【问题讨论】:

    标签: python machine-learning regression cross-validation lightgbm


    【解决方案1】:

    它不对应于折叠,而是对应于每个提升轮的 cv 结果(所有测试折叠的 RMSE 的平均值),如果我们只说 5 轮并打印每轮的结果,您可以非常清楚地看到这一点:

    import lightgbm as lgb
    from sklearn.datasets import load_boston
    X, y = load_boston(return_X_y=True)
    train_set = lgb.Dataset(X,label = y)
    
    params = {'learning_rate': 0.05,'num_leaves': 4,'subsample': 0.5}
    
    cv_results = lgb.cv(params, train_set, num_boost_round = 5, nfold = N_FOLDS, verbose_eval  = True,
                        early_stopping_rounds = None, metrics = 'rmse', seed = 50, stratified=False)
    
    [LightGBM] [Info] Total Bins 1251
    [LightGBM] [Info] Number of data points in the train set: 404, number of used features: 13
    [LightGBM] [Info] Start training from score 22.585149
    [LightGBM] [Info] Start training from score 22.109406
    [LightGBM] [Info] Start training from score 22.579703
    [LightGBM] [Info] Start training from score 22.784158
    [LightGBM] [Info] Start training from score 22.599010
    [1] cv_agg's rmse: 8.86903 + 0.88135
    [2] cv_agg's rmse: 8.58355 + 0.860252
    [3] cv_agg's rmse: 8.31477 + 0.842578
    [4] cv_agg's rmse: 8.06201 + 0.82627
    [5] cv_agg's rmse: 7.8268 + 0.800053
    
    import pandas as pd
    pd.DataFrame(cv_results)
    
        rmse-mean   rmse-stdv
    0   8.869030    0.881350
    1   8.583552    0.860252
    2   8.314774    0.842578
    3   8.062014    0.826270
    4   7.826800    0.800053
    

    在您的帖子中,您设置了 early_stopping_rounds = 100 并使用了默认值 learning rate = 0.1,这取决于您的数据可能有点高,所以它很可能在 6 轮后停止。

    使用上面的相同示例,如果我们设置early_stopping_rounds = 100,它会每 100 轮评估一次指标的改进,并在停止前 100 轮返回结果:

    cv_results = lgb.cv(params, train_set, num_boost_round = 2000, nfold = N_FOLDS, 
    verbose_eval  = True,early_stopping_rounds = 100, metrics = 'rmse',
    seed = 50, stratified=False)
    
    [...]
    [1475]  cv_agg's rmse: 3.20605 + 0.50213
    [1476]  cv_agg's rmse: 3.20616 + 0.501997
    [1477]  cv_agg's rmse: 3.20607 + 0.501998
    [1478]  cv_agg's rmse: 3.20636 + 0.501865
    [1479]  cv_agg's rmse: 3.20631 + 0.501905
    [1480]  cv_agg's rmse: 3.20633 + 0.501731
    [1481]  cv_agg's rmse: 3.20659 + 0.501494
    [1482]  cv_agg's rmse: 3.2068 + 0.502046
    [1483]  cv_agg's rmse: 3.20687 + 0.50213
    [1484]  cv_agg's rmse: 3.20701 + 0.502265
    [1485]  cv_agg's rmse: 3.20717 + 0.502096
    [1486]  cv_agg's rmse: 3.2072 + 0.501779
    [1487]  cv_agg's rmse: 3.20722 + 0.501613
    [1488]  cv_agg's rmse: 3.20718 + 0.501308
    [1489]  cv_agg's rmse: 3.20701 + 0.501232
    
    pd.DataFrame(cv_results).shape
    (1389, 2)
    

    如果您想从模型中估计 rmse,请取最后一个值。

    【讨论】:

      猜你喜欢
      • 2017-10-08
      • 2015-11-02
      • 2019-01-24
      • 2018-11-08
      • 2018-03-23
      • 2018-12-27
      • 2019-03-20
      • 2018-12-22
      • 1970-01-01
      相关资源
      最近更新 更多