Suppose you are a physician with a patient whose complaint could arise from multiple diseases. To attain a specific diagnosis, you might ask yourself a series of yes/no questions depending on observed features describing the patient, such as clinical test results and reported symptoms. As some questions rule out certain diagnoses early on, each answer determines which question you ask next. With about a dozen features and extensive medical knowledge, you could create a simple flow chart to connect and order these questions. If you had observations of thousands of features instead, you would probably want to automate.Machine learning methods can learn which questions to ask about these features to classify the entity they describe.
假设你是一位内科医生,有一个病人的抱怨可能是由多种疾病引起的。为了获得特定的诊断,您可以根据描述患者的观察特征(如临床测试结果和报告的症状)问自己一系列是/否问题。由于有些问题会在早期排除某些诊断,每个答案都会决定您接下来要问的问题。有了十几个功能和丰富的医学知识,您可以创建一个简单的流程图来连接和排序这些问题。如果你要观察到成千上万的特征,你可能会想自动化。机器学习方法可以学习针对这些特征提出哪些问题来对它们所描述的实体进行分类。
Even when we lack prior knowledge,a classifier can tell us which features are most important and how they relate to, or interact with, each other. Identifying interactions with large numbers of features poses a special challenge. In PNAS,Basu et al. (1) address this problem with a new classifier based on the widely used random forest technique.The new method, an iterative random forest algorithm (iRF), increases the robustness of random forest classifiers and provides a valuable new way to identify important feature interactions.
即使我们缺乏先验知识,分类器也可以告诉我们哪些特征是最重要的,以及它们如何相互关联或相互作用。识别具有大量特征的交互构成了一个特殊的挑战。在PNAS中,Basu等人(1) 针对这一问题,使用一种基于广泛使用的随机森林技术的新分类器解决此问题,即迭代随机森林算法(iRF),它提高了随机森林分类器的鲁棒性,为识别重要特征交互提供了一种有价值的新方法。
Random forests came into the spotlight in 2001 after their description by Breiman (2). He was largely influenced by previous work, especially the similar “randomized trees” method of Amit and Geman (3),as well as Ho’s “random decision forests” (4). Random forests have since proven useful in many fields due to their high predictive accuracy (5, 6).In biology and medicine, random forests have successfully tackled a range of problems, including predicting drug response in cancer cell lines (7), identifying DNAbinding proteins (8), and localizing cancer to particular tissues from a liquid biopsy (9). Random forests have also recognized speech (10, 11) and handwritten digits (12) with high accuracy.
2001年,在Breiman 描述了随机森林之后(2),随机森林成为了人们关注的焦点。他在很大程度上受到了前人工作的影响,特别是类似Amit 和Geman 的“随机树”方法(3),以及何的“随机决策森林”(4)。随机森林由于其高预测精度(5,6),在许多领域被证明是有用的。在生物学和医学领域,随机森林已经成功地解决了一系列问题,包括预测癌细胞系的药物反应(7),鉴定DNA结合蛋白(8),以及通过液体活检将癌细胞定位到特定组织 (9)。随机森林还以较高的准确度识别语音(10, 11)和手写数字(12)。
Like their real-world counterparts, random forests consist of trees. Specifically, random forests are ensembles of decision trees. Morgan and Sonquist (13) proposed the decision tree methodology in 1963, formalizing an intuitive approach to simplifying the analysis of multiple features during prediction tasks. We use the decision tree on an input dataset made up of a collection of samples, each described by features (Fig.1A). Each sample represents an entity, such as a protein,that we want to assign to a class, such as “binds DNA” or “does not bind DNA” (8). The decision tree classifies samples through a forking path of decision points (Fig. 1B). Each decision point has a rule determining which branch to take. As we move down the tree, we stop at each decision point to apply its rule to one of the sample’s features. Eventually, we arrive at the end of the branch, or leaf. The leaf has a class label, and we conclude our path through the tree by assigning the sample to that class.
与现实世界中的同类一样,随机森林由树组成。具体来说,随机森林是决策树的集合。Morgan和Sonquist(13) 在1963年提出了决策树方法,形式化了一种直观的方法来简化预测任务中对多个特征的分析。我们在由一组样本组成的输入数据集上使用决策树,每个样本由特征描述(图1A)。每一个样本代表一个实体,比如一个蛋白质,我们想要分配给一个类,比如“结合DNA”或者“不结合DNA”(8)。决策树通过决策点的分叉路径对样本进行分类(图1B)。每个决策点都有一个规则来决定要采取哪个分支。当我们沿着树向下移动时,我们在每个决策点停下来,将其规则应用于样本的一个特征。最终,我们到达了枝叶的末端。叶子上有一个类标签,我们通过将样本分配给该类来结束遍历树的路径。
While we can easily use a decision tree’s rules to classify a sample, where do those rules come from? We can construct them using training data in which a known class accompanies each sample’s features. Our goal is to create a tree that can later predict classes correctly from the features alone. There are a variety of algorithms to train decision trees (14–16), but we will describe one of the simplest methods (17). This method minimizes the heterogeneity, or impurity, of the classes of training data assigned to each branch. First, we identify the rule that will split the training data into two branches with the least class impurity, and establish a decision point with this rule. We then further subdivide the resulting branches by creating new rules in the same way. We continue splitting until we can find no rule that further reduces class impurity. This training process generates a trained decision tree made up of multiple decision points, with each possible path through the tree terminating in a class label.
虽然我们可以很容易地使用决策树的规则对样本进行分类,但这些规则是从何而来的?我们可以使用训练数据来构造它们,其中每个样本的特征都对应一个已知类别。我们的目标是创建一个树,它可以仅根据特征正确地预测类别。训练决策树有多种算法(14-16),但我们将描述最简单的方法之一(17)。 这种方法最大程度地减少了分配给每个分支的训练数据类别的异质性或杂质。首先,我们确定将训练数据分成杂质最少的两个分支的规则,并以此规则建立决策点。然后,通过以相同方式创建新规则,进一步细分结果分支。我们继续分裂,直到找不到任何进一步减少类杂质的规则。该训练过程将生成由多个决策点组成的经过训练的决策树,通过该决策树的每条可能路径都以一个类标签结尾。
Despite ease of interpretation, decision trees often perform poorly on their own (18). We can improve accuracy by instead using an ensemble of decision trees (Fig. 1 B and C),combining votes from each (Fig. 1D). A random forest is such an ensemble, where we select the best feature for splitting at each node from a random subset of the available features (5,18). This random selection causes the individual decision trees of a random forest to emphasize different features. The resulting diversity of trees can capture more complex feature patterns than a single decision tree and reduces the chance of overfitting to training data. In this way, the random forest improves predictive accuracy.
尽管易于解释,但决策树通常自身表现不佳(18)。我们可以改为使用决策树的集合(图1 B和C),将每个决策树的投票相结合(图1D),从而提高准确性。随机森林就是这样一个集合,在这里我们从可用特征的随机子集中选择最佳特征,以便在每个节点处进行分割(5,18)。这种随机选择使随机森林的各个决策树强调不同的特征。与单一决策树相比,随机森林的树的多样性可以捕获更复杂的特征模式,并减少过度拟合训练数据的机会。这样,随机森林提高了预测准确性。
In addition to high predictive performance, random forest classifiers can reveal feature importance (5), telling us how much each feature contributes to class prediction. It is here where the new method of Basu et al(1) delivers its most important advance. By weighting features according to feature importance, the authors grow more relevant trees to uncover complex interactions.To do this, they iteratively refine a random forest, leading to iRF. First, they begin with a weighted random forest, one in which each feature has equal weight, indicating an equal probability of being chosen. In the initial round, the weighted random forest behaves in the same way as Breiman’s original random forest (2). Second, they repeatedly train weighted random forests, using the feature importance from one iteration as the weights in the next. Third, they use the final weights to generate several weighted random forests, each trained on a random selection of samples. This is a bootstrap selection, meaning each sample can appear more than once. Fourth, Basu et al. (1) use the random intersection trees algorithm (19) to find subsets of features that often co-occur.Fifth, they assess the extracted interactions with a stability score averaged over all bootstrap selections. The stability score describes the fraction of times a recovered interaction occurs, with stable interactions having scores greater than 0.5. A higher stability score means it is less likely that random chance alone caused identification of the interaction.
除了具有较高的预测性能外,随机森林分类器还可以揭示特征重要性(5),告诉我们每个特征对类别预测的贡献有多大。Basu 等人(1)的新方法就是在这里取得了最重要的进展。通过根据特征重要性对特征进行加权,作者可以生成更多的相关树来揭示复杂的交互作用。为了做到这一点,他们迭代地优化一个随机的森林,从而得到iRF。首先,它们从一个加权随机林开始,每个特征的权重相等,表示被选择的概率相等。在初始回合中,加权随机林的行为与Breiman的原始随机林(2)相同。其次,他们反复训练加权随机森林,使用一次迭代的特征重要性作为下一次迭代的权重。第三,他们使用最终的权重生成几个加权的随机森林,每个森林都接受随机样本选择的训练。这是bootstrap抽样,意味着每个示例可以出现多次。第四,Basu等人(1) 使用RIT算法(19)查找经常同时出现的特征子集。第五,他们使用所有自举程序选择的平均稳定性分数评估提取的交互作用。稳定性分数描述了发生恢复的相互作用的分数,其中稳定的相互作用的分数大于0.5。较高的稳定性得分意味着该交互只有较小的可能是凭随机机会确定的相互作用。
To demonstrate iRF’s efficacy, Basu et al(1) apply it to several genomic problems, detecting multiway interactions between chromatin-interacting proteins, both known and novel. This moves beyond popular techniques that focus on pairwise interactions. For example, they use iRF to predict genomic enhancers in Drosophila melanogaster from quantitative signal of transcription factor and histone modification presence within each genomic region. They identify 20 pairwise transcription factor interactions, of which 16 are consistent with previously reported physical interactions. They also identify novel thirdorder interactions involving the early regulatory factor Zelda. This provides an intriguing path to further investigating Zelda, a link to the past reports of codependency with other factors that drive enhancer activity.
为了证明iRF的有效性,Basu等人(1) 将其应用于几个基因组问题,检测染色质和蛋白质之间的多途径相互作用,包括已知的和新的。这超越了专注于成对交互的流行技术。例如,他们使用iRF从每个基因组区域内转录因子和组蛋白修饰存在的定量信号来预测果蝇中的基因组增强子。他们鉴定了20对转录因子相互作用,其中16对与先前报道的物理相互作用一致。他们还发现了新的涉及早期调控因子Zelda的第三方相互作用。这为进一步研究Zelda提供了一条有趣的途径,Zelda链接到了过去关于增强子活动与其他因素相互依赖的报道。
Of course, iRF provides a flexible method, whose utility extends past genomics to any classification and feature selection task.
In simulations, iRF successfully detects up to order-8 interactions.At the same time, iRF maintains predictive performance similar to conventional random forests. To improve further, one might explore ways to combine iRF with other ensemble methods.As Basu et al. (1) mention, AdaBoost (20) focuses on the least reliable parts of decision trees and could complement iRF’s focus on the most reliable parts. Building on iRF in this way will prove easier due to the installable R package the authors provide, making their methodology accessible to users and extenders alike. iRF holds much promise as a new and effective way of detecting interactions in a variety of settings, and its use will help us ensure no branch or leaf is ever left unturned.
当然,iRF提供了一种灵活的方法,其实用程度将基因组学扩展到了任何分类和特征选择任务。在模拟中,iRF成功地检测到了高达8阶的交互作用,同时保持了与传统随机森林相似的预测性能。为了进一步改进,人们可能会探索将iRF与其他集成方法相结合的方法(1)提到,AdaBoost(20) 专注于决策树中最不可靠的部分,并且可以补充iRF对最可靠部分的关注。由于作者提供了可安装的R包,因此以这种方式在iRF上构建将变得更加容易,从而使用户和扩展程序都可以使用其方法。iRF很有希望成为在各种环境中检测交互的一种有效的新方法,它的使用将帮助我们确保任何分支或叶子都不会被遗弃。