【发布时间】:2020-01-31 17:35:39
【问题描述】:
我正在使用xgboost 库来训练二元分类器。我想通过向权重(例如集合中树的叶节点处的值)添加噪声来防止训练算法的数据泄漏。为此,我需要检索每棵树的权重并对其进行修改。
我可以通过在 Booster 对象上使用 dump_model 或 trees_to_dataframe 来查看权重,我将其定义为
model = xgb.Booster(params, [dtrain])
后一种方法返回一个 Pandas 数据框
Tree Node ID Feature Split Yes No Missing Gain Cover
0 0 0 0-0 tenure 17.0 0-1 0-2 0-1 671.161072 1595.500
1 0 1 0-1 InternetService_Fiber optic 1.0 0-3 0-4 0-3 343.489227 621.125
2 0 2 0-2 InternetService_Fiber optic 1.0 0-5 0-6 0-5 293.603149 974.375
3 0 3 0-3 tenure 4.0 0-7 0-8 0-7 95.604340 333.750
4 0 4 0-4 TotalCharges 120.0 0-9 0-10 0-9 27.897919 287.375
5 0 5 0-5 Contract_Two year 1.0 0-11 0-12 0-11 32.057739 512.625
6 0 6 0-6 tenure 60.0 0-13 0-14 0-13 120.693176 461.750
7 0 7 0-7 TechSupport_No internet service 1.0 0-15 0-16 0-15 37.326447 149.750
8 0 8 0-8 TechSupport_No internet service 1.0 0-17 0-18 0-17 34.968536 184.000
9 0 9 0-9 TechSupport_Yes 1.0 0-19 0-20 0-19 0.766754 65.500
10 0 10 0-10 MultipleLines_Yes 1.0 0-21 0-22 0-21 19.335510 221.875
11 0 11 0-11 PhoneService_Yes 1.0 0-23 0-24 0-23 19.035950 281.125
12 0 12 0-12 Leaf NaN NaN NaN NaN -0.191398 231.500
13 0 13 0-13 PaymentMethod_Electronic check 1.0 0-25 0-26 0-25 43.379410 320.875
14 0 14 0-14 Contract_Two year 1.0 0-27 0-28 0-27 13.401367 140.875
15 0 15 0-15 Leaf NaN NaN NaN NaN 0.050262 94.500
16 0 16 0-16 Leaf NaN NaN NaN NaN -0.052444 55.250
17 0 17 0-17 Leaf NaN NaN NaN NaN -0.058929 111.000
18 0 18 0-18 Leaf NaN NaN NaN NaN -0.148649 73.000
19 0 19 0-19 Leaf NaN NaN NaN NaN 0.161464 63.875
其中叶值存储在 Gain 列中(叶节点是那些在 Feature 列中具有值 Leaf 的节点)。因此,我可以为 Gain 列中的相应行添加噪声,但是我不知道如何将 Pandas 数据帧转换回 Booster 对象/XGBoost 模型。我应该如何实现这一目标?或者是否有其他更好的方法来检索和修改 XGBoost 叶节点的值?
【问题讨论】: