特征重要性--feature_importance

feature_importance的特征重要性

There are indeed several ways to get feature "importances". As often, there is no strict consensus about what this word means.

In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read...). It is sometimes called "gini importance" or "mean decrease impurity" and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.

In the literature or in some other packages, you can also find feature importances implemented as the "mean decrease accuracy".

Basically, the idea is to measure the decrease in accuracy on OOB data when you randomly permute the values for that feature. If the decrease is low, then the feature is not important, and vice-versa.

(Note that both algorithms are available in the randomForest R package.)

[1]: Breiman, Friedman, "Classification and regression trees", 1984.

其实sklearn的作者说并没有给出feature importance的具体定义。

feature importance有两种常用实现思路：
　　（1） mean decrease in node impurity:

feature importance is calculated by looking at the splits of each tree.
The importance of the splitting variable is proportional to the improvement to the gini index given by that split and it is accumulated (for each variable) over all the trees in the forest.

就是计算每棵树的每个划分特征在划分准则（gini或者entropy）上的提升，然后对聚合所有树得到特征权重

　　（2） mean decrease in accuracy:

 This method, proposed in the original paper, passes the OOB samples down the tree and records prediction accuracy. 
A variable is then selected and its values in the OOB samples are randomly permuted. OOB samples are passed down the tree and accuracy is computed again. 
A decrease in accuracy obtained by this permutation is averaged over all trees for each variable and it provides the importance of that variable (the higher the decreas the higher the importance).

　　　　简单来说，如果该特征非常的重要，那么稍微改变一点它的值，就会对模型造成很大的影响。

自己造数据太麻烦，可以直接在OOB数据集对该维度的特征数据进行打乱，重新训练测试，打乱前的准确率减去打乱后的准确率就是该特征的重要度。该方法又叫permute。

以random forest为例

以random forest为例，feature importance特性有助于模型的可解释性。简单考虑下，就算在解释性很强的决策树模型中，如果树过于庞大，人类也很难解释它做出的结果。

随机森林通常会有上百棵树组成，更加难以解释。好在我们可以找到那些特征是更加重要的，从而辅助我们解释模型。更加重要的是可以剔除一些不重要的特征，降低杂讯。比起pca降维后的结果，更具有人类的可理解性。

sklearn中实现的是第一种方法【mean decrease in node impurity】实现的。使用iris数据集我们来看下效果。

　　首先，对于单棵决策树，权重是怎么计算的呢？

iris = load_iris()
X = iris.data
y = iris.target
dt = DecisionTreeClassifier(criterion='entropy', max_leaf_nodes=3)
dt.fit(X, y)

　　使用Graphviz画出生成的决策树

特征重要性--feature_importance

　　sklearn中tree的代码是用Cython编写的，具体这部分的源码在compute_feature_importances方法中

　根据上面生成的树，我们计算特征2和特征3的权重

　　特征2： 1.585*150 - 0*50 - 1.0*100 = 137.35

　　　　特征3： 1.0*100 - 0.445*54 - 0.151*46 = 69.024

　　归一化之后得到[0, 0, 0.665, 0.335] 我们算的结果和sklearn输出的结果相同。

得到每课树的特征重要度向量之后，就加和平均得到结果，具体代码如下：

def feature_importances_(self):
        """Return the feature importances (the higher, the more important the
           feature).
 
        Returns
        -------
        feature_importances_ : array, shape = [n_features]
        """
        check_is_fitted(self, 'n_outputs_')
 
        if self.estimators_ is None or len(self.estimators_) == 0:
            raise ValueError("Estimator not fitted, "
                             "call `fit` before `feature_importances_`.")
 
        all_importances = Parallel(n_jobs=self.n_jobs, backend="threading")(
            delayed(getattr)(tree, 'feature_importances_')
            for tree in self.estimators_)
 
        return sum(all_importances) / len(self.estimators_)

View Code