随机森林模型解释

In this post, we will explain what a Random Forest model is, see its strengths, how it is built, and what it can be used for.

在这篇文章中,我们将解释什么是随机森林模型,了解其优势,如何构建以及可用于什么。

We will go through the theory and intuition of Random Forest, seeing the minimum amount of maths necessary to understand how everything works, without diving into the most complex details.

我们将通过“随机森林”的理论和直觉,了解在不深入了解最复杂细节的情况下了解所有工作原理所需的最少数学量。

Let’s get to it!

让我们开始吧!

1.简介 (1. Introduction)

In the Machine Learning world, Random Forest models are a kind of non parametric models that can be used both for regression and classification. They are one of the most popular ensemble methods, belonging to the specific category of Bagging methods.

在机器学习世界中,随机森林模型是一种参数模型,可用于回归和分类。 它们是最流行的合奏方法之一,属于Bagging方法的特定类别。

Ensemble methods involve using many learners to enhance the performance of any single one of them individually. These methods can be described as techniques that use a group of weak learners (those who on average achieve only slightly better results than a random model) together, in order to create a stronger, aggregated one.

集成方法涉及使用许多学习者来分别提高其中任何一个学习者的表现。 这些方法可以描述为一种技术,它使用一组弱势学习者(那些学习者平均仅比随机模型获得更好的结果),以创建一个更强大的汇总者

In our case, Random Forests are an ensemble of many individual Decision Trees. If you are not familiar with Decision Trees, you can learn all about them here:

在我们的案例中,随机森林是许多单独的决策树的集合。 如果您不熟悉决策树,则可以在这里了解所有有关决策树的信息:

One of the main drawbacks of Decision Trees is that they are very prone to over-fitting: they do well on training data, but are not so flexible for making predictions on unseen samples. While there are workarounds for this, like pruning the trees, this reduces their predictive power. Generally they are models with medium bias and high variance, but they are simple and easy to interpret.

决策树的主要缺点之一是它们很容易过度拟合:它们在训练数据上表现出色,但是对于对看不见的样本进行预测并不那么灵活。 尽管有一些解决方法,例如修剪树木,但这会降低其预测能力。 通常,它们是具有中等偏差和高方差的模型,但是它们简单易懂。

If you are not very confident with the difference between bias and variance, check out the following post:

如果您对偏差和方差之间的差异不是很有把握,请查看以下文章:

Random Forest models combine the simplicity of Decision Trees with the flexibility and power of an ensemble model. In a forest of trees, we forget about the high variance of an specific tree, and are less concerned about each individual element, so we can grow nicer, larger trees that have more predictive power than a pruned one.

随机森林模型将决策树简单性与集成模型的灵活性和强大功能相结合。 在一片森林中,我们会忘记一棵特定树木的高变异性,而不必担心每个元素,因此我们可以种植更好的,更大的树木,比修剪的树木具有更大的预测能力。

Al tough Random Forest models don’t offer as much interpret ability as a single tree, their performance is a lot better, and we don’t have to worry so much about perfectly tuning the parameters of the forest as we do with individual trees.

严格的随机森林模型没有提供像单棵树那么多的解释能力,它们的性能要好得多,而且我们不必像单独树一样担心完美调整森林的参数。

Okay, I get it, a Random Forest is a collection of individual trees. But why the name Random? Where is the Randomness? Lets find out by learning how a Random Forest model is built.

好的,我明白了,随机森林是单个树木的集合。 但是为什么命名为Random? 随机性在哪里? 通过学习如何建立随机森林模型来找出答案。

2.训练和建立随机森林 (2. Training and Building a Random Forest)

Building a random Forest has 3 main phases. We will break down each of them and clarify each of the concepts and steps. Lets go!

建立随机森林有3个主要阶段。 我们将分解每个概念,并阐明每个概念和步骤。 我们走吧!

2.1为每棵树创建一个引导数据集 (2.1 Creating a Bootstrapped Data Set for each tree)

When we build an individual decision tree, we use a training data set and all of the observations. This means that if we are not careful, the tree can adjust very well to this training data, and generalise badly to new, unseen observations. To solve this issue, we stop the tree from growing very large, usually at the cost of reducing its performance.

当我们建立一个单独的决策树时,我们使用训练数据集和所有观察值。 这意味着,如果我们不小心,则树可以很好地适应此训练数据,并严重地推广到新的,看不见的观测结果。 为了解决此问题,我们通常以降低其性能为代价,阻止树变得很大。

To build a Random Forest we have to train N decision trees. Do we train the trees using the same data all the time? Do we use the whole data set? Nope.

要建立一个随机森林,我们必须训练N个决策树。 我们是否一直使用相同的数据训练树木? 我们是否使用整个数据集? 不。

This is where the first random feature comes in. To train each individual tree, we pick a random sample of the entire Data set, like shown in the following figure.

这是第一个随机特征出现的地方。要训练每棵树,我们选择整个数据集的随机样本,如下图所示。

随机森林模型解释_随机森林解释
Flaticon. Flaticon的图标。

From looking at this figure, various things can be deduced. First of all, the size of the data used to train each individual tree does not have to be the size of the whole data set. Also, a data point can be present more than once in the data used to train a single tree (like in tree nº two).

从这个数字看,可以得出各种结论。 首先,用于训练每棵单独的树的数据的大小不必一定是整个数据集的大小。 同样,一个数据点可以在用于训练单棵树的数据中不止一次出现(例如在第二棵树中)。

This is called Sampling with Replacement or Bootstrapping: each data point is picked randomly from the whole data set, and a data point can be picked more than once.

这称为替换或自举抽样:从整个数据集中随机抽取每个数据点,并且一个数据点可以被多次抽取。

By using different samples of data to train each individual tree we reduce one of the main problems that they have: they are very fond of their training data. If we train a forest with a lot of trees and each of them has been trained with different data, we solve this problem. They are all very fond of their training data, but the forest is not fond of any specific data point. This allows us to grow larger individual trees, as we do not care so much anymore for an individual tree overfitting.

通过使用不同的数据样本来训练每棵树,我们减少了它们的主要问题之一:他们非常喜欢他们的训练数据。 如果我们用许多树木训练一个森林,并且每个树木都接受了不同的数据训练,那么我们可以解决这个问题。 他们都很喜欢他们的训练数据,但是森林并不喜欢任何特定的数据点。 这使我们可以生长更大的单棵树,因为我们不再关心单个树的过度拟合。

If we use a very small portion of the whole data set to train each individual tree, we increase the randomness of the forest (reducing over-fitting) but usually at the cost of a lower performance.

如果我们使用整个数据集的一小部分来训练每棵树,则会增加森林的随机性(减少过度拟合),但通常以降低性能为代价。

In practice, by default most Random Forest implementations (like the one from Scikit-Learn) pick the sample of the training data used for each tree to be the same size as the original data set (however it is not the same data set, remember that we are picking random samples).

实际上,默认情况下,大多数随机森林实现(例如Scikit-Learn的实现)都会选择用于每棵树的训练数据的样本,使其大小与原始数据集的大小相同(但请注意,它不是同一数据集)我们正在挑选随机样本)。

This generally provides a good bias-variance compromise.

通常,这提供了良好的偏差方差折衷。

2.2使用这些随机数据集训练树木,并通过特征选择增加一些随机性 (2.2 Train a forest of trees using these random data sets, and add a little more randomness with the feature selection)

If you remember well, for building an individual decision tree, at each node we evaluated a certain metric (like the Gini index, or Information Gain) and picked the feature or variable of the data to go in the node that minimised/maximised this metric.

如果您没记错的话,为了构建单独的决策树,我们在每个节点上评估了某个指标(如Gini索引或Information Gain),并选择了数据的特征或变量放入将指标最小化/最大化的节点中。

This worked decently well when training only one tree, but now we want a whole forest of them! How do we do it? Ensemble models, like Random Forest work best if the individual models (individual trees in our case) are uncorrelated. In Random Forest this is achieved by randomly selecting certain features to evaluate at each node.

仅训练一棵树时,这种方法效果很好,但是现在我们需要一整棵树! 我们该怎么做呢? 如果各个模型(在本例中为单个树)不相关,则像随机森林这样的集成模型最有效。 在随机森林中,这是通过随机选择某些要在每个节点上进行评估的特征来实现的。

随机森林模型解释_随机森林解释
Flaticon. Flaticon的图标。

As you can see from the previous image, at each node we evaluate only a subset of all the initial features. For the root node we take into account E, A and F (and F wins). In Node 1 we consider C, G and D (and G wins). Lastly, in Node 2 we consider only A, B, and G (and A wins). We would carry on doing this until we built the whole tree.

从上一张图像可以看到,在每个节点处,我们仅评估所有初始特征的一个子集。 对于根节点,我们考虑E,A和F(以及F获胜)。 在节点1中,我们考虑C,G和D(以及G获胜)。 最后,在节点2中,我们仅考虑A,B和G(而A获胜)。 我们将继续执行此操作,直到我们构建完整的树为止。

By doing this, we avoid including features that have a very high predictive power in every tree, while creating many un-correlated trees. This is the second sweep of randomness. We do not only use random data, but also random features when building each tree. The greater the tree diversity, the better: we reduce the variance, and get a better performing model.

通过这样做,我们避免在每棵树中包含具有很高预测能力的要素,同时创建许多不相关的树。 这是随机性的第二次扫描。 构建每棵树时,我们不仅使用随机数据,还使用随机特征。 树的多样性越大,越好:我们减少方差,并获得性能更好的模型。

2.3对N棵树重复此操作,以创建我们的令人敬畏的森林。 (2.3 Repeat this for the N trees to create our awesome forest.)

Awesome, we have learned how to build a single decision tree. Now, we would repeat this for the N trees, randomly selecting on each node of each of the trees which variables enter the contest for being picked as the feature to split on.

太棒了,我们已经学习了如何构建单个决策树。 现在,我们将对N棵树重复此操作,在每棵树的每个节点上随机选择哪些变量进入竞赛以被选为分割要素。

In conclusion, the whole process goes as follows:

总之,整个过程如下:

  1. Create a bootstrapped data set for each tree.

    为每棵树创建一个引导数据集。
  2. Create a decision tree using its corresponding data set, but at each node use a random sub sample of variables or features to split on.

    使用其对应的数据集创建决策树,但在每个节点上使用随机的子变量或特征子样本进行分解。
  3. Repeat all these three steps hundreds of times to build a massive forest with a wide variety of trees. This variety is what makes a Random Forest way better than a single decision tree.

    重复执行所有这三个步骤数百次,以构建具有各种各样树木的大型森林。 这种多样性使“随机森林”方法比单个决策树更好。

Once we have built our forest, we are ready to use it to make awesome predictions. Lets see how!

建立森林后,我们就可以使用它做出令人敬畏的预测。 让我们看看如何!

3.使用随机森林进行预测 (3. Making predictions using a Random Forest)

Making predictions with a Random Forest is very easy. We just have to take each of our individual trees, pass the observation for which we want to make a prediction through them, get a prediction from every tree (summing up to N predictions) and then obtain an overall, aggregated prediction.

使用随机森林进行预测非常容易。 我们只需要获取我们每棵单独的树,通过我们想要对其进行预测的观测值,从每棵树中获取一个预测(总计N个预测),然后获得总体的,汇总的预测。

Bootstrapping the data and then using an aggregate to make a prediction is called Bagging, and how this prediction is made depends on the kind of problem we are facing.

引导数据,然后使用聚合进行预测称为Bagging,如何进行此预测取决于我们面临的问题类型。

For regression problems, the aggregate decision is the average of the decisions of every single decision tree. For classification problems, the final prediction is the most frequent prediction done by the forest.

对于回归问题,总决策是每个决策树的决策的平均值。 对于分类问题,最终预测是森林所做的最频繁的预测。

随机森林模型解释_随机森林解释
Flaticon. Flaticon的图标。

The previous image illustrates this very simple procedure. For the classification problem we want to predict if a certain patient is sick or healthy. For this we pass his medical record and other information through each tree of the random forest, and obtain N predictions (400 in our case). In our example 355 of the trees say that the patient is healthy and 45 say that the patient is sick, therefore the forest decides that the patient is healthy.

上一张图片说明了此非常简单的过程。 对于分类问题,我们要预测某个患者是否生病或健康。 为此,我们通过随机森林的每棵树传递他的病历和其他信息,并获得N个预测(本例中为400个)。 在我们的示例中,有355棵树表示患者健康,而45棵树表示患者生病,因此森林确定该患者健康。

For the regression problem we want to predict the price of a certain house. We pass the characteristics of this new house through our N trees, getting a numerical prediction from each of them. Then, we calculate the average of these predictions and get the final value of 322.750$.

对于回归问题,我们要预测某个房屋的价格。 我们通过N棵树传递这栋新房子的特征,并从每棵树中获得数值预测。 然后,我们计算这些预测的平均值,得到最终值322.750 $。

Simple right? We make a prediction with every individual tree and then aggregate these predictions using the mean (average) or the mode (most frequent value).

简单吧? 我们对每棵树进行预测,然后使用均值(平均值)或众数(最频繁值)汇总这些预测。

4.结论和其他资源 (4. Conclusion and other resources)

In this post we have seen what a Random Forest is, how it overcomes the main issues of Decision Trees, how they are trained, and used to make predictions. They are very flexible and powerful Machine Learning models that are highly used in commercial and industrial applications, along with Boosting models and Artificial Neural Networks.

在这篇文章中,我们了解了什么是随机森林,如何克服决策树的主要问题,如何训练决策树以及如何进行预测。 它们是非常灵活且功能强大的机器学习模型,与Boosting模型和人工神经网络一起广泛用于商业和工业应用中。

On future posts we will explore tips and tricks of Random Forests and how they can be used for feature selection. Also, if you want to see precisely how they are built, check out the following video by StatQuest, its great:

在以后的文章中,我们将探讨随机森林的技巧和窍门,以及如何将其用于特征选择。 另外,如果您想确切地了解它们的构建方式,请观看StatQuest的以下视频,它很棒:

That is it! As always, I hope you enjoyed the post. If you did feel free to follow me on Twitter at @jaimezorno. Also, you can take a look at my other posts on Data Science and Machine Learning here, and subscribe to my newsletter to get awesome goodies and notifications on new posts!

这就对了! 一如既往,希望您喜欢这个职位。 如果您愿意在Twitter上关注我,请访问@jaimezorno 此外,您还可以看看我的关于数据科学和机器学习等职位这里,和订阅我的通讯,以获得新的职位真棒糖果和通知!

For further resources on Machine Learning and Data Science check out the following repository: How to Learn Machine Learning!

有关机器学习和数据科学的更多资源,请查看以下存储库:如何学习机器学习

翻译自: https://towardsdatascience.com/random-forest-explained-7eae084f3ebe

随机森林模型解释

相关文章:

猜你喜欢
  • 2021-06-25
  • 2021-09-03
  • 2021-04-05
  • 2022-12-23
  • 2021-01-23
相关资源
相似解决方案