评估 SMOTE 和 RandomUnderSampling 不同的策略答案

【问题标题】：Evaluate SMOTE and RandomUnderSampling different strategies评估 SMOTE 和 RandomUnderSampling 不同的策略
【发布时间】：2022-01-21 01:18:13
【问题描述】：

我正在 Python 中使用数据框 df 处理 pandas。我正在执行分类任务并且有两个不平衡的类df['White'] 和df['Non-white']。为此，我构建了一个同时包含 SMOTE 和 RandomUnderSampling 的管道。

这是我的管道的样子：

model = Pipeline([
        ('preprocessor', preprocessor),
        ('smote', over),
        ('random_under_sampler', under),
        ('classification', knn)
    ])

这些是确切的步骤：

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('knnimputer', KNNImputer(),
                                                  ['policePrecinct']),
                                                 ('onehotencoder-1',
                                                  OneHotEncoder(), ['gender']),
                                                 ('standardscaler',
                                                  StandardScaler(),
                                                  ['long', 'lat']),
                                                 ('onehotencoder-2',
                                                  OneHotEncoder(),
                                                  ['neighborhood',
                                                   'problem'])])),
                ('smote', SMOTE()),
                ('random_under_sampler', RandomUnderSampler()),
                ('classification', KNeighborsClassifier())])

我想评估 SMOTE 和 RandomUnderSampling 中的不同 sampling_strategy。调整参数时，我可以直接在 GridSearch 中执行此操作吗？现在，我写了以下for loop。此循环不起作用 (ValueError: too many values to unpack (expected 2))。

strategy_sm = [0.1, 0.3, 0.5]
strategy_un = [0.15, 0.30, 0.50]
best_strat = []

for k, n in strategy_sm, strategy_un:
    over = SMOTE(sampling_strategy=k)
    under = RandomUnderSampler(sampling_strategy=n)
    model = Pipeline([
        ('preprocessor', preprocessor),
        ('smote', over),
        ('random_under_sampler', under),
        ('classification', knn)
    ])
    mode.fit(X_train, y_train)
    best_strat.append[(model.score(X_train, y_train))]

我对 Python 不是很精通，我怀疑有更好的方法来做到这一点。另外，我想要for loop（如果这确实是这样做的方法），以可视化sampling_strategy 组合的差异性能。有什么想法吗？

【问题讨论】：

您确定要在像这样在同一管道中过采样后进行欠采样吗？您的代码不会独立评估管道中的 SMOTE 和 RandomUnderSampling。
我正在关注这个guide，其中提到：关于 SMOTE 的原始论文建议将 SMOTE 与多数类的随机欠采样结合起来。 我已经检查过了，他们确实建议这。不幸的是，如果您不单独插入两个采样器，您会遇到各种问题
听起来不错。我的评论主要是关于措辞，我认为是想独立评估。
我现在明白了。是的，不单独评估它们是有意义的。我曾试图将它们组合成一个管道。但我不断收到TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or fit_resample。我认为有必要制作一个具有过采样和欠采样的平坦管道，因为存在歧义，因为不平衡学习管道定义了 fit/transform 和 fit_resample。
您可能需要使用 imblearn 管道，因为采样器接口可能不符合 scikit-learn 管道的预期

标签： python pandas machine-learning scikit-learn

【解决方案1】：

下面是一个示例，说明如何使用 5 折交叉验证比较分类器对不同参数组合的准确度并可视化结果。

import pandas as pd
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

# generate some data
X, y = make_classification(n_classes=2, weights=[0.1, 0.9], n_features=20, random_state=42)

# define the pipeline
estimator = Pipeline([
    ('smote', SMOTE()),
    ('random_under_sampler', RandomUnderSampler()),
    ('classification', KNeighborsClassifier())
])

# define the parameter grid
param_grid = {
    'smote__sampling_strategy': [0.3, 0.4, 0.5],
    'random_under_sampler__sampling_strategy': [0.5, 0.6, 0.7]
}

# run a grid search to calculate the cross-validation
# accuracy associated to each parameter combination
clf = GridSearchCV(
    estimator=estimator,
    param_grid=param_grid,
    cv=StratifiedKFold(n_splits=3)
)

clf.fit(X, y)

# organize the grid search results in a data frame
res = pd.DataFrame(clf.cv_results_)

res = res.rename(columns={
    'param_smote__sampling_strategy': 'smote_strategy',
    'param_random_under_sampler__sampling_strategy': 'random_under_sampler_strategy',
    'mean_test_score': 'accuracy'
})

res = res[['smote_strategy', 'random_under_sampler_strategy', 'accuracy']]

print(res)
#   smote_strategy random_under_sampler_strategy  accuracy
# 0            0.3                           0.5  0.829471
# 1            0.4                           0.5  0.869578
# 2            0.5                           0.5  0.899881
# 3            0.3                           0.6  0.809269
# 4            0.4                           0.6  0.819370
# 5            0.5                           0.6  0.778669
# 6            0.3                           0.7  0.708259
# 7            0.4                           0.7  0.778966
# 8            0.5                           0.7  0.768568

# plot the grid search results
res_ = res.pivot(index='smote_strategy', columns='random_under_sampler_strategy', values='accuracy')
sns.heatmap(res_, annot=True, cbar_kws={'label': 'accuracy'})

【讨论】：

感谢您的回答，弗拉维亚。我一直遇到以下错误：ValueError: The specified ratio required to remove samples from the minority class while trying to generate new samples. Please increase the ratio.
我在 X_train 和 y_train 上训练 clf。以下是 y_train 的值计数： - 非白色 34707 - 白色 15718
尝试在我刚刚发布的更新代码中使用StratifiedKFold。
我已经尝试对两者的广泛 ([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]) 比率范围进行试验（只是为了探索）并忽略了错误。最后，我得到了一个图表，表明在采样器上方和下方的唯一有效组合是 [0.5, 0.6, 0.7]。编辑我会尝试代码并反馈
修复不起作用，并为之前的代码返回相同的错误（但也返回相同的最终结果）。然而，对我来说，这是有道理的。简单地说，数据允许比例的小组合。我不会超过 0.7，因为我会撒谎以避免过度拟合。感谢你的付出。我将尝试找到一种方法来直接忽略错误并在之前得到没有红色错误的最终图表