具有 fold_column 参数的 H2O 中的集成模型答案

【问题标题】：Ensemble model in H2O with fold_column argument具有 fold_column 参数的 H2O 中的集成模型
【发布时间】：2018-05-28 19:13:51
【问题描述】：

我是 python 中的 H2O 新手。我正在尝试按照 H2O 网站上的示例代码使用集成模型对我的数据进行建模。 (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html)

我已应用 GBM 和 RF 作为基本模型。然后使用堆叠，我尝试将它们合并到集成模型中。此外，在我的训练数据中，我创建了一个名为“fold”的附加列，用于fold_column = "fold"

我申请了 10 倍的 cv，我观察到我收到了来自 cv1 的结果。然而，来自其他 9 个 cv 的所有预测都是空的。我在这里错过了什么？

这是我的示例数据：

代码：

import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function

h2o.init(port=23, nthreads=6)

train = h2o.H2OFrame(ens_df)
test = h2o.H2OFrame(test_ens_eq)

x = train.drop(['Date','EQUITY','fold'],axis=1).columns
y = 'EQUITY'

cat_cols = ['A','B','C','D']
train[cat_cols] = train[cat_cols].asfactor()
test[cat_cols] = test[cat_cols].asfactor()

my_gbm = H2OGradientBoostingEstimator(distribution="gaussian",
                                      ntrees=10,
                                      max_depth=3,
                                      min_rows=2,
                                      learn_rate=0.2,
                                      keep_cross_validation_predictions=True,
                                      seed=1)

my_gbm.train(x=x, y=y, training_frame=train, fold_column = "fold")

然后当我用

检查简历结果时

my_gbm.cross_validation_predictions()：

另外，当我在测试集中尝试集成时，我收到以下警告：

# Train a stacked ensemble using the GBM and GLM above
ensemble = H2OStackedEnsembleEstimator(model_id="mlee_ensemble",
                                       base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)

# Eval ensemble performance on the test data
perf_stack_test = ensemble.model_performance(test)

pred = ensemble.predict(test)
pred

/mgmt/data/conda/envs/python3.6_4.4/lib/python3.6/site-packages/h2o/job.py:69: UserWarning: Test/Validation dataset is missing column 'fold': substituting in a column of NaN
  warnings.warn(w)

我是否遗漏了有关 fold_column 的内容？

【问题讨论】：

您能否修改您的示例以使其使用公开可用的数据集？ stackoverflow.com/help/mcve 另外请说明您是如何检查 CV preds 的（这里没有显示您在做什么的代码）。
@ErinLeDell 我把与 CV preds 相关的行。此外，虽然我将创建一个示例数据集，但我有一个小问题。我注意到在示例代码中，它使用cars.kfold_column(n_folds = 5, seed = 1234) 为 fold_column 分配随机数。我不想分配随机数，而是想为 fold_column 使用数据（列表等）。例如。我试过train['fold'].kfold_column()，但它仍然分配随机数。如何将数据引入 kfold_column？或者在不使用kfold_column 的情况下，只在训练集中有一个“折叠”列就足够了？

标签： python-3.x h2o ensemble-learning

【解决方案1】：

以下是如何使用自定义折叠列（从列表创建）的示例。这是 H2O 用户指南 Stacked Ensemble 页面中 example Python code 的修改版本。

import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
from __future__ import print_function
h2o.init()

# Import a sample binary outcome training set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()

# Add a fold column, generate from a list
# The list has 10 unique values, so there will be 10 folds
fold_list = list(range(10)) * 1000
train['fold_id'] = h2o.H2OFrame(fold_list)


# Train and cross-validate a GBM
my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
                                      ntrees=10,
                                      keep_cross_validation_predictions=True,
                                      seed=1)
my_gbm.train(x=x, y=y, training_frame=train, fold_column="fold_id")

# Train and cross-validate a RF
my_rf = H2ORandomForestEstimator(ntrees=50,
                                 keep_cross_validation_predictions=True,
                                 seed=1)
my_rf.train(x=x, y=y, training_frame=train, fold_column="fold_id")

# Train a stacked ensemble using the GBM and RF above
ensemble = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)

回答关于如何查看模型中交叉验证的预测的第二个问题。它们存储在两个地方，但是，您可能要使用的方法是：.cross_validation_holdout_predictions() 此方法以训练观察的原始顺序返回交叉验证预测的单个 H2OFrame：

In [11]: my_gbm.cross_validation_holdout_predictions()
Out[11]:
  predict        p0        p1
---------  --------  --------
        1  0.323155  0.676845
        1  0.248131  0.751869
        1  0.288241  0.711759
        1  0.407768  0.592232
        1  0.507294  0.492706
        0  0.6417    0.3583
        1  0.253329  0.746671
        1  0.289916  0.710084
        1  0.524328  0.475672
        1  0.252006  0.747994

[10000 rows x 3 columns]

第二种方法，.cross_validation_predictions() 是一个列表，它将每个折叠的预测存储在 H2OFrame 中，该 H2OFrame 具有与原始训练帧相同的行数，但在该折叠中不活动的行的值为零。这通常不是人们认为最有用的格式，所以我建议改用其他方法。

In [13]: type(my_gbm.cross_validation_predictions())
Out[13]: list

In [14]: len(my_gbm.cross_validation_predictions())
Out[14]: 10

In [15]: my_gbm.cross_validation_predictions()[0]
Out[15]:
  predict        p0        p1
---------  --------  --------
        1  0.323155  0.676845
        0  0         0
        0  0         0
        0  0         0
        0  0         0
        0  0         0
        0  0         0
        0  0         0
        0  0         0
        0  0         0

[10000 rows x 3 columns]

【讨论】：