XGBoost：xgb.importance 特征图答案

【问题标题】：XGBoost: xgb.importance feature mapXGBoost：xgb.importance 特征图
【发布时间】：2016-05-24 10:50:26
【问题描述】：

当我尝试使用以下代码时出现以下错误。

******代码******

    importance = bst.get_fscore(fmap='xgb.fmap')
    importance = sorted(importance.items(), key=operator.itemgetter(1))

******错误******

  File "scripts/xgboost_bnp.py", line 225, in <module>
  importance = bst.get_fscore(fmap='xgb.fmap')
  File "/usr/lib/python2.7/site-packages/xgboost/core.py", line 754, in get_fscore
    trees = self.get_dump(fmap)
  File "/usr/lib/python2.7/site-packages/xgboost/core.py", line 740, in get_dump
   ctypes.byref(sarr)))
  File "/usr/lib/python2.7/site-packages/xgboost/core.py", line 92, in _check_call
raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: can not open file "xgb.fmap"

【问题讨论】：

标签： feature-selection xgboost

【解决方案1】：

引发错误是因为您使用可选参数 fmap 调用 get_fscore，说明每个特征的特征重要性应从名为 xgb.fmap 的特征映射文件中获取，该文件系统中不存在该文件。

这是一个返回排序后的特征名称及其重要性的函数：

import xgboost as xgb
import pandas as pd

def get_xgb_feat_importances(clf):

    if isinstance(clf, xgb.XGBModel):
        # clf has been created by calling
        # xgb.XGBClassifier.fit() or xgb.XGBRegressor().fit()
        fscore = clf.booster().get_fscore()
    else:
        # clf has been created by calling xgb.train.
        # Thus, clf is an instance of xgb.Booster.
        fscore = clf.get_fscore()

    feat_importances = []
    for ft, score in fscore.iteritems():
        feat_importances.append({'Feature': ft, 'Importance': score})
    feat_importances = pd.DataFrame(feat_importances)
    feat_importances = feat_importances.sort_values(
        by='Importance', ascending=False).reset_index(drop=True)
    # Divide the importances by the sum of all importances
    # to get relative importances. By using relative importances
    # the sum of all importances will equal to 1, i.e.,
    # np.sum(feat_importances['importance']) == 1
    feat_importances['Importance'] /= feat_importances['Importance'].sum()
    # Print the most important features and their importances
    print feat_importances.head()
    return feat_importances

【讨论】：

感谢您的回答，但是这个解决方案没有显示原始特征名称，而只是返回 fxx 代表某个特征，您知道如何将真实特征名称与重要性分数映射?
我猜你的训练数据存储在 NumPy 数组中？尝试使用 Pandas DataFrame 训练模型（将适当的特征名称设置为列名）并再次运行上述函数。如果我没记错的话，XGBoost 会从 Pandas DataFrame 的列名中提取特征名。
或者如果您通过xgboost.DMatrix() 定义训练数据，您可以通过其feature_names 参数定义特征名称。
再次感谢，你说得对，我没有在 xgboost.DMatrix() 中设置 feature_names 参数，你的解决方案效果很好，我将其更改为输出到文件以查看之后的功能重要性我的特征选择训练
@tuomastik 你是对的，如果你的训练数据是 pd.DataFrame 格式，XGBoost 会提取特征名称