【问题标题】：What is the scale of the leaf values in a CatBoostRegressor tree?CatBoostRegressor 树中叶值的比例是多少？
【发布时间】：2022-04-19 04:31:47
【问题描述】：

谜题

我无法解释 CatBoostRegressor 树的叶子中的值。拟合的模型正确地捕捉到了数据集的逻辑，但是当我绘制一棵树时，值的比例与实际数据集的比例不匹配。

在此示例中，我们预测 size，其值约为 15-30，具体取决于观测值的 color 和 age。

import random
import pandas as pd
import numpy as np
from catboost import Pool, CatBoostRegressor

# Create a fake dataset.
n = 1000
random.seed(1)
df = pd.DataFrame([[random.choice(['red', 'blue', 'green', 'yellow']),
                    random.random() * 100]
                   for i in range(n)],
                  columns=['color', 'age'])
df['size'] = np.select([np.logical_and(np.logical_or(df.color == 'red',
                                                     df.color == 'blue'),
                                       df.age < 50),
                        np.logical_or(df.color == 'red',
                                      df.color == 'blue'),
                        df.age < 50,
                        True],
                       [np.random.normal(loc=15, size=n),
                        np.random.normal(loc=20, size=n),
                        np.random.normal(loc=25, size=n),
                        np.random.normal(loc=30, size=n)])

# Fit a CatBoost regressor to the dataset.
pool = Pool(df[['color', 'age']], df['size'],
            feature_names=['color', 'age'], cat_features=[0])
m = CatBoostRegressor(n_estimators=10, max_depth=3, one_hot_max_size=4,
                      random_seed=1)
m.fit(pool)

# Visualize the first regression tree (saves to a pdf).  Values in leaf nodes
# are not on the scale of the original dataset.
m.plot_tree(tree_idx=0, pool=pool).render('regression_tree')

模型在 age 上以正确的值（大约 50）分裂，并且它正确地学习到红色和蓝色的观测值与绿色和黄色的观测值不同。叶子中的值排序正确（例如，50 以下的红色/蓝色观察值最小），但尺度完全不同。

predict() 函数返回原始数据集规模的值。

>>> df['predicted'] = m.predict(df)
>>> df.sample(n=10)
      color        age       size  predicted
676  yellow  66.305095  30.113389  30.065519
918  yellow  55.209821  29.944622  29.464825
705  yellow   1.742565  24.209283  24.913988
268    blue  76.749979  20.513211  20.019020
416    blue  59.807800  18.807197  19.949336
326     red   4.621795  14.748898  14.937314
609  yellow  99.165027  28.942243  29.823422
421   green  40.731038  26.078450  24.846742
363  yellow   2.461971  25.506517  24.913988
664     red   5.206448  16.579706  14.937314

我尝试过的

我想知道是否有某种简单的标准化正在进行，但显然不是这样。例如，年龄

>>> (15 - np.mean(df['size'])) / np.std(df['size'])
-1.3476124913754326

This post 提出了一个关于 XGBoost 的类似问题。接受的答案解释说，这些值都应该添加到base_score 参数中；但是，如果CatBoost 中有类似的参数，我就找不到了。（如果参数在CatBoost 中使用不同的名称，我不知道它叫什么。）此外，CatBoost 树中的值不仅仅与原始数据集相差某个常数；最大和最小叶节点之间的差异约为7，而原始数据集中size的最大值和最小值之间的差异约为15。

我浏览了CatBoost 文档，但没有成功。 “Model values”部分表示回归的值是“应用模型产生的数字”，这表明它们应该在原始数据集的范围内。（predict() 的输出也是如此，所以我不清楚这部分是否适用于绘制的决策树。）

【问题讨论】：

标签： python decision-tree catboost catboostregressor

【解决方案1】：

搜索这个函数get_scale_and_bias 返回模型的尺度和偏差。

这些值会影响应用模型的结果，因为模型预测结果的计算如下： \sum leaf_values \cdot scale + bias∑leaf_values⋅scale+bias

应用于问题中的示例

这是适用于同一数据集的稍有不同的模型（使用与上述相同的代码）。

要将叶子值转换为原始数据比例，请使用get_scale_and_bias() 返回的比例和偏差。我使用_get_tree_leaf_values() 提取了叶子；这个函数返回叶子的字符串表示，所以我们必须做一些正则表达式解析来获取实际值。我还根据上面的数据生成过程手动编码了每个叶子的期望值。

# Get the scale and bias from the model.
sb = m.get_scale_and_bias()

# Apply the scale and bias to the leaves of the tree; compare to expected
# values for each leaf.
import re
[{'expected': [15, 25, 25, None, 20, 30, 30, None][i],
  'actual': (float(re.sub(r'^val = (-?[0-9]+([.][0-9]+)?).*$', '\\1', leaf))
             * sb[0]) + sb[1]}
 for i, leaf in enumerate(m._get_tree_leaf_values(0))]

我们看到预测值并不完美，但至少在正确的范围内。

[{'expected': 15, 'actual': 19.210155044555663},
 {'expected': 25, 'actual': 24.067155044555665},
 {'expected': 25, 'actual': 24.096155044555665},
 {'expected': None, 'actual': 22.624155044555664},
 {'expected': 20, 'actual': 21.309155044555663},
 {'expected': 30, 'actual': 26.244155044555665},
 {'expected': 30, 'actual': 26.249155044555664},
 {'expected': None, 'actual': 22.624155044555664}]

【讨论】：

您的答案可以通过额外的支持信息得到改进。请edit 添加更多详细信息，例如引用或文档，以便其他人可以确认您的答案是正确的。你可以找到更多关于如何写好答案的信息in the help center。
这正是我需要的参考；谢谢你！我已经编辑了答案，以添加一个如何在问题中描述的特定模型上使用此函数的示例。