Imblearn SMOTE：如何为多类不平衡数据集设置 sample_strategy 参数？答案

【问题标题】：Imblearn SMOTE: How to set the sample_strategy parameter for a multiclass imbalance dataset?Imblearn SMOTE：如何为多类不平衡数据集设置 sample_strategy 参数？
【发布时间】：2021-06-28 09:21:35
【问题描述】：

我正在尝试处理具有以下形状的网络攻击数据集：

df.shape
(1074992, 42)

并且攻击和正常行为的标签有以下计数：

df['Label'].value_counts()
normal            812814
neptune           242149
satan               5019
ipsweep             3723
portsweep           3564
smurf               3007
nmap                1554
back                 968
teardrop             918
warezclient          893
pod                  206
guesspasswd           53
bufferoverflow        30
warezmaster           20
land                  19
imap                  12
rootkit               10
loadmodule             9
ftpwrite               8
multihop               7
phf                    4
perl                   3
spy                    2
Name: Label, dtype: int64

接下来我将数据集拆分为特征和标签。

labels = df['Label']
features = df.loc[:, df.columns != 'Label'].astype('float64')

然后尝试平衡我的数据集。

print("Before UpSampling, counts of label Normal: {}".format(sum(labels == "normal")))
print("Before UpSampling, counts of label Attack: {} \n".format(sum(labels != "normal")))
Before UpSampling, counts of label Normal: 812814
Before UpSampling, counts of label Attack: 262178

所以你可以注意到攻击的数量与正常行为的数量不成比例。

我尝试使用 SMOTE 使少数（攻击）类与多数类（普通）具有相同的值。

sm = SMOTE(k_neighbors = 1,random_state= 42)   #Synthetic Minority Over Sampling Technique
features_res, labels_res = sm.fit_resample(features, labels)
features_res.shape ,labels_res.shape
((18694722, 41), (18694722,))

我不明白这就是为什么我在应用 SMOTE 后得到 18694722 个值。

print("After UpSampling, counts of label Normal: {}".format(sum(labels_res == "normal")))
print("After UpSampling, counts of label Attack: {} \n".format(sum(labels_res != "normal")))
After UpSampling, counts of label Normal: 812814
After UpSampling, counts of label Attack: 17881908

对于我的情况，对 Normal 类进行下采样还是对 Attack 类进行上采样会更好？有关如何正确执行此操作的任何想法？

非常感谢。

【问题讨论】：

标签： python pandas data-processing imblearn smote

【解决方案1】：

默认情况下，sampling_strategy 的 SMOTE 是 not majority，

'不是多数'：重采样除多数类以外的所有类

所以，如果多数类的样本是 812814，那么您将拥有

(812814 * 23) = 18694722

样本。

尝试为少数类传递具有所需样本数量的字典。来自docs

当 dict 时，键对应于目标类。这些值对应于每个目标类的所需样本数。

示例

改编自docs，在此示例中，我们对少数类之一进行上采样，使其具有与多数类相同数量的样本。

from sklearn.datasets import make_classification
from collections import Counter
from imblearn.over_sampling import SMOTE 
X, y = make_classification(n_classes=5, 
    class_sep=2, 
    weights=[0.15, 0.15, 0.1, 0.1, 0.5], 
    n_informative=4, 
    n_redundant=1, 
    flip_y=0,
    n_features=20, 
    n_clusters_per_class=1,
    n_samples=1000,
    random_state=10)

sample_strategy = {4: 500, 0: 500, 1: 150, 2: 100, 3: 100}

sm = SMOTE(sampling_strategy=sample_strategy, random_state=0)
X_res, y_res = sm.fit_resample(X, y)
from collections import Counter
print('Resampled dataset shape %s' % Counter(y_res))
>>>
Resampled dataset shape Counter({4: 500, 0: 500, 1: 150, 3: 100, 2: 100})

【讨论】：

谢谢@Miguel 我之前已经用字典探索过这个选项，但是因为我有 23 个目标类，所以我试图寻找其他东西。但我认为这个选项也适用于我的情况。