Abstract - Undersampling is a popular method in dealing with class-imbalance problems, which uses only a subset of the majority class and thus is very efficient. The main deficiency is that many majority class examples are ignored. We propose two algorithms to overcome this deficiency. EasyEnsemblesamples several subsets from the majority class, trains a learner using each of them, and combines the outputs of those learners. BalanceCascadetrains the learners sequentially, where in each step, the majority class examples that are correctly classified by the current trained learners are removed from further consideration. Experimental results show that both methods have higher Area Under the ROC Curve, F-measure, and G-mean values than many existing class-imbalance learning methods. Moreover, they have approximately the same training time as that of undersampling when the same number of weak classifiers is used, which is significantly faster than other methods.

Index Terms - Class-imbalance learning, data mining, ensemble learning, machine learning, undersampling

In many real-world problems, the data sets are typically imbalanced, i.e., some classes have much more instances than others. The level of imbalance (ratio of size of the majority class to minority class) can be as huge as Exploratory Undersampling for Class-Imbalance Learning. It is noteworthy that class imbalance is emerging as an important issue in designing classifiers.

Imbalance has a serious impact on the performance of classifiers. Learning algorithm that do not consider class imbalance tend to be overwhelmed by the majority class and ignore the minority class. For example, in a problem with imbalance level of Exploratory Undersampling for Class-Imbalance Learning, a learning algorithm that minimizes error rate could decide to classify all examples as the majority class in order to achieve a low error rate of Exploratory Undersampling for Class-Imbalance Learning. However, all minority class examples will be wrong classified in this case. In problems where the imbalance level is huge, class imbalance must be carefully handled to build a good classifier.

Class imbalance is also closely related to cost-sensitive learning, another important issue in machine learning. Misclassifying a minority class instance is usually more serious than misclassifying a majority class one. For example, approving a fraudulent credit card application is more costly than declining a credible one. Breiman et al. pointed out that training set size, class priors, cost of errors in different classes, and placement of decision boundaries are all closely connected. In fact, many existing methods for dealing with class imbalance rely on connections among these four components. Sampling methods handle class imbalance by varying the minority and majority class sizes in the training set. Cost-sensitive learning deals with class imbalance by incurring different costs for the two classes and is considered as an important class of methods to handle class imbalance. More details about class-imbalance learning methods are presented in Section II.

In this paper, we examine only binary classification problems by ensembling classifiers built from multiple under sampled training sets. Undersampling is an efficient method for class-imbalance learning. This method uses a subset of the majority class to train the classifier. Since many majority class examples are ignored, the training set becomes more balanced and the training process becomes faster. However, the main drawback of undersampling is that potentially useful information contained in these ignored examples is neglected. The intuition of our proposed methods is then to wisely explore these ignored data while keeping the fast training speed of understanding.

We propose two way to use these data. One straightforward way is to sample several subsets independently from Exploratory Undersampling for Class-Imbalance Learning (the majority class), use these subsets to train classifiers separately, and combine the trained classifiers. Another method is to use trained classifiers to guide the sampling process for subsequent classifiers. After we have trained Exploratory Undersampling for Class-Imbalance Learning classifiers, examples correctly classified by them will be removed from Exploratory Undersampling for Class-Imbalance Learning. Experiments on 16 UCI data sets show that both methods have higher Area Under the receiver operating characteristics (ROC) Curve (AUC), F-measure, and G-mean values than many existing class-imbalance learning methods.

III. EasyEnsemble AND BalanceCascade

As was shown by Drummond and Holte, undersampling is an efficient strategy to deal with class imbalance. However, the drawback of undersampling is that it throws away many potentially useful data. In this section, we propose two strategies to explore the majority class examples ignored by undersampling: EasyEnsemble and BalanceCascade.

A. EasyEnsemble

Given the minority training set Exploratory Undersampling for Class-Imbalance Learning and the majority training set Exploratory Undersampling for Class-Imbalance Learning, the undersampling method randomly samples a subset Exploratory Undersampling for Class-Imbalance Learning from Exploratory Undersampling for Class-Imbalance Learning, where Exploratory Undersampling for Class-Imbalance Learning. Usually, we choose Exploratory Undersampling for Class-Imbalance Learning and therefore have Exploratory Undersampling for Class-Imbalance Learning for highly imbalanced problems.

EasyEnsemble is probably the most straightforward way to further exploit the majority class examples ignored by undersampling, i.e., examples in Exploratory Undersampling for Class-Imbalance Learning. In this method, we independently sample several subsets Exploratory Undersampling for Class-Imbalance Learning from Exploratory Undersampling for Class-Imbalance Learning. For each subset Exploratory Undersampling for Class-Imbalance Learning (Exploratory Undersampling for Class-Imbalance Learning), a classifier Exploratory Undersampling for Class-Imbalance Learning is trained using Exploratory Undersampling for Class-Imbalance Learning and all of Exploratory Undersampling for Class-Imbalance Learning. All generated classifiers are combined for the final decision. AdaBoost is used to train the classifier Exploratory Undersampling for Class-Imbalance Learning. The pseudocode for EasyEnsemble is shown in Algorithm 1.

Algorithm 1 The EasyEnsemble algorithm

1: {Input: A set of minority class examples Exploratory Undersampling for Class-Imbalance Learning, a set of majority class examples Exploratory Undersampling for Class-Imbalance Learning, Exploratory Undersampling for Class-Imbalance Learning, the number of subsets Exploratory Undersampling for Class-Imbalance Learningto sample from Exploratory Undersampling for Class-Imbalance Learning, and Exploratory Undersampling for Class-Imbalance Learning, the number of iterations to train an AdaBoost ensemble Exploratory Undersampling for Class-Imbalance Learning}

2: Exploratory Undersampling for Class-Imbalance Learning

3: repeat

4: Exploratory Undersampling for Class-Imbalance Learning

5: Randomly sample a subset Exploratory Undersampling for Class-Imbalance Learning from Exploratory Undersampling for Class-Imbalance Learning, Exploratory Undersampling for Class-Imbalance Learning.

6: Learn Exploratory Undersampling for Class-Imbalance Learning using Exploratory Undersampling for Class-Imbalance Learning and Exploratory Undersampling for Class-Imbalance Learning. Exploratory Undersampling for Class-Imbalance Learning is an AdaBoost ensemble with Exploratory Undersampling for Class-Imbalance Learning weak classifiers Exploratory Undersampling for Class-Imbalance Learningand corresponding weights Exploratory Undersampling for Class-Imbalance Learning. The ensemble's threshold is Exploratory Undersampling for Class-Imbalance Learning, i.e.,

Exploratory Undersampling for Class-Imbalance Learning (1)

7: until Exploratory Undersampling for Class-Imbalance Learning

8: Output: An ensemble

Exploratory Undersampling for Class-Imbalance Learning (2)

The idea behind EasyEnsemble is simple. Similar to the Balanced Random Forests, EasyEnsemble generates Exploratory Undersampling for Class-Imbalance Learning balanced subproblems. The output of the Exploratory Undersampling for Class-Imbalance Learningth subproblem is AdaBoost classifier Exploratory Undersampling for Class-Imbalance Learning, an ensemble with Exploratory Undersampling for Class-Imbalance Learning weak classifiers Exploratory Undersampling for Class-Imbalance Learning. An alternative view of Exploratory Undersampling for Class-Imbalance Learning is to treat it as a feature that is extracted by the ensemble learning method and can only take binary values. Exploratory Undersampling for Class-Imbalance Learning, in this viewpoint, is simply a linear classifier built on these features. Features extracted from different subsets Exploratory Undersampling for Class-Imbalance Learning thus contain information of different aspects of the original majority training set Exploratory Undersampling for Class-Imbalance Learning. Finally, instead of counting votes from Exploratory Undersampling for Class-Imbalance Learning, we collect all the features Exploratory Undersampling for Class-Imbalance Learningand form an ensemble classifier from them.

The output of EasyEnsemble is a single ensemble, but it looks like an “ensemble of ensembles”. It is known that boosting mainly reduces bias, while bagging mainly reduces variance. Several works, combine different ensemble strategies to achieve stronger generalization. MultiBoosting, combines boosting with bagging/wagging by using boosted ensembles as base learners. Stochastic Gradient Boosting and Cocktail Ensemble also combine different ensemble strategies. It is evident that EasyEnsemble has benefited from the combination of boosting and a bagging-like strategy with balanced class distribution.

Both EasyEnsemble and Balanced Random Forests try to use balanced boostrap samples; however, the former uses the samples to generate boosted ensembles, while the latter uses the samples to train decision trees randomly. Costing also uses multiple samples of the original training set. Costing was initially proposed as a cost-sensitive learning method, while EasyEnsemble is proposed to deal with class imbalance directly. Moreover, the working style of EasyEnsemble is quite different from costing. For example, the costing method samples the examples with probability in proportion to their costs (rejection sampling). Since this is a probability-based sampling method, no positive example will definitely appear in all the samples (in fact, the probability of a positive example appearing in all the samples is small). While in EasyEnsemble, all the positive examples will definitely appear in all the samples. When the size of minority class is very small, it is important to utilize every minority class example.

B. BalanceCascade

EasyEnsemble is an unsupervised strategy to explore Exploratory Undersampling for Class-Imbalance Learning since it uses independent random sampling with replacement. Our second algorithm, BalanceCascade, explores Exploratory Undersampling for Class-Imbalance Learning in a supervised manner. The idea is as follows. After Exploratory Undersampling for Class-Imbalance Learning is trained, if an example Exploratory Undersampling for Class-Imbalance Learning is correctly classified to be in the majority class by Exploratory Undersampling for Class-Imbalance Learning, it is reasonable to conjecture that Exploratory Undersampling for Class-Imbalance Learning is somewhat redundant in Exploratory Undersampling for Class-Imbalance Learning, given that we already have Exploratory Undersampling for Class-Imbalance Learning. Thus, we can remove some correctly classified majority class examples from Exploratory Undersampling for Class-Imbalance Learning. As in EasyEnsemble, we use AdaBoost in this method. The pseudocode of BalanceCascade is described in Algorithm 2.

Algorithm 2 The BalanceCascade algorithm

1: {Input: A set of minority class examples Exploratory Undersampling for Class-Imbalance Learning, a set of majority class examples Exploratory Undersampling for Class-Imbalance Learning, Exploratory Undersampling for Class-Imbalance Learning, the number of subsets Exploratory Undersampling for Class-Imbalance Learning to sample from Exploratory Undersampling for Class-Imbalance Learning, and Exploratory Undersampling for Class-Imbalance Learning, the number of iterations to train an AdaBoost ensemble Exploratory Undersampling for Class-Imbalance Learning}

2: Exploratory Undersampling for Class-Imbalance Learning, Exploratory Undersampling for Class-Imbalance Learning, Exploratory Undersampling for Class-Imbalance Learning is the false positive rate (the error rate of misclassifying a majority class example to the minority class) that Exploratory Undersampling for Class-Imbalance Learning should achieve.

3: repeat

4: Exploratory Undersampling for Class-Imbalance Learning

5: Randomly sample a subset Exploratory Undersampling for Class-Imbalance Learning from Exploratory Undersampling for Class-Imbalance Learning, Exploratory Undersampling for Class-Imbalance Learning.

6: Learn Exploratory Undersampling for Class-Imbalance Learning using Exploratory Undersampling for Class-Imbalance Learning and Exploratory Undersampling for Class-Imbalance Learning. Exploratory Undersampling for Class-Imbalance Learning is an AdaBoost ensemble with Exploratory Undersampling for Class-Imbalance Learning weak classifiers Exploratory Undersampling for Class-Imbalance Learning and corresponding weights Exploratory Undersampling for Class-Imbalance Learning.

The ensemble's threshold is Exploratory Undersampling for Class-Imbalance Learning i.e.,

Exploratory Undersampling for Class-Imbalance Learning (3)

7: Adjust Exploratory Undersampling for Class-Imbalance Learning such that Exploratory Undersampling for Class-Imbalance Learning's false positive rate is Exploratory Undersampling for Class-Imbalance Learning.

8: Remove from Exploratory Undersampling for Class-Imbalance Learning all examples that are correctly classified by Exploratory Undersampling for Class-Imbalance Learning.

9: until Exploratory Undersampling for Class-Imbalance Learning

10: Output: A single ensemble

Exploratory Undersampling for Class-Imbalance Learning (4)

This method is called BalanceCascade since it is somewhat similar to the cascade classifier. The majority training set Exploratory Undersampling for Class-Imbalance Learning is shrunk after every Exploratory Undersampling for Class-Imbalance Learning is trained, and every node Exploratory Undersampling for Class-Imbalance Learning is dealing with a balanced subproblem (Exploratory Undersampling for Class-Imbalance Learning). However, the final classifier is different. A cascade classifier is the conjunction of all Exploratory Undersampling for Class-Imbalance Learning, i.e., Exploratory Undersampling for Class-Imbalance Learning predicts positive if and only if all Exploratory Undersampling for Class-Imbalance Learning (Exploratory Undersampling for Class-Imbalance Learning) predicts positive. Viola and Jones used the cascade classifier mainly to achieve fast testing speed. While in BalanceCascade, sequential dependence between classifiers is mainly exploited for reducing the redundant information in the majority class. This sampling strategy leads to a restricted sample space for the following undersampling process to explore as much useful information as possible.

BalanceCascade is similar to EasyEnsemble in their structures. The main difference between them is the lines 7 and 8 of Algorithm 2. Line removes the true majority class examples from Exploratory Undersampling for Class-Imbalance Learning, and line 7 specifies how many majority class examples can be removed. At the beginning of the Exploratory Undersampling for Class-Imbalance Learningth iteration, Exploratory Undersampling for Class-Imbalance Learning has been shrunk Exploratory Undersampling for Class-Imbalance Learning times, and therefore, its current size is Exploratory Undersampling for Class-Imbalance Learning. Thus, after Exploratory Undersampling for Class-Imbalance Learning is trained and Exploratory Undersampling for Class-Imbalance Learning is shrunk again, the size of Exploratory Undersampling for Class-Imbalance Learning is smaller than Exploratory Undersampling for Class-Imbalance Learning. We can stop the training process at this time.

相关文章:

  • 2021-11-28
  • 2021-08-24
  • 2022-12-23
  • 2022-12-23
  • 2021-04-07
  • 2022-12-23
  • 2021-06-15
  • 2021-08-30
猜你喜欢
  • 2021-10-28
  • 2022-12-23
  • 2022-02-12
  • 2021-04-23
  • 2022-12-23
  • 2021-12-30
  • 2021-11-13
相关资源
相似解决方案