如何在火车集中重复少数类的行？答案

【问题标题】：How can I repeat rows of minority class in train set?如何在火车集中重复少数类的行？
【发布时间】：2019-03-26 04:53:12
【问题描述】：

我想在我的火车组中重复我的少数族裔班级的特定行。我知道，这不是一种很花哨的工作方式，但我只是想尝试一下。

假设，我有这个数据框：

> df

    group     type  number
1   class1     one    4
2   class1   three   10
3   class1    nine    3
4   class4   seven    9
5   class1   eight    4
6   class1     ten    2
7   class1     two   22
8   class4  eleven    8

现在我想多次重复我的少数类 (class4) 的行，以便在新数据框中有 50% 的 class1 和 50% 的 class4。

我知道有 rep 函数，但我只能找到重复整个数据帧的解决方案。

我该怎么做？

【问题讨论】：

你有什么 id 超过 2 个组？那么你想让他们分成 33% 吗？
不，我只要这两门课
您不需要这样做：如果您只想在重采样中提高少数类的权重，只需将每个类的权重设置为与频率成反比即可。 大多数分类器（RF、树、LR、NN 等允许权重）。如果您想通过创建合成示例来重新采样少数类，请使用 SMOTE。见Dealing with the class imbalance in binary classification
@smci 感谢您的评论！我已经在决策树中加权了我的少数类并使用了 SMOTE 函数，但结果并不那么有希望。
@pineapple：嗯，请告诉我们更多信息。你对训练的评价函数是什么？（原始准确率？AUC？别的什么？）类不平衡有多大，请张贴表格。 至于 SMOTE，请发布您的确切命令行。还要张贴评估功能的前后分数。

标签： r machine-learning classification repeat sampling

【解决方案1】：

这是一个使用tidyverse的选项

library(tidyverse)
n1 <- df %>% 
        count(group) %>% 
        slice(which.max(n)) %>%
        pull(n) 
df %>%
   filter(group == "class4") %>%
   mutate(n = n1/2) %>% 
   uncount(n) %>%
   bind_rows(filter(df, group == "class1"))
#    group   type number
#1  class4  seven      9
#2  class4  seven      9
#3  class4  seven      9
#4  class4 eleven      8
#5  class4 eleven      8
#6  class4 eleven      8
#7  class1    one      4
#8  class1  three     10
#9  class1   nine      3
#10 class1  eight      4
#11 class1    ten      2
#12 class1    two     22

【讨论】：

【解决方案2】：

基础 R 方法

#Count frequency of groups
tab <- table(df$group)

#Count number of rows to be added
no_of_rows <- max(tab) - min(tab)

#count number of rows which are already there in the dataframe for the minimum group
existing_rows <- which(df$group %in% names(which.min(tab)))

#Add new rows
new_df <- rbind(df, df[rep(existing_rows,no_of_rows/length(existing_rows)), ])

#Check the count
table(new_df$group)

#class1 class4 
#     6      6

【讨论】：

@pineapple：你真的不需要这样做，只需在你的分类器上设置权重。几乎所有好的分类器实现都支持每个类或每个示例的权重。

【解决方案3】：

我建议您使用“合成少数过采样技术 (SMOTE)”（Chawla 等人，2002 年）或“随机过采样示例 (ROSE)” （Menardi 和 Torelli，2013 年）。

1)您可以通过在trainControl 中添加sampling= 来调整每个交叉验证折叠中的采样。

例如：

trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 10, 
                     sampling = "up")

2) 或者，通过调用 SMOTE 和 ROSE 函数在训练前调整采样。

library("DMwR") #for smote
library("ROSE")

dat <- iris[1:70,]
dat$Species <- factor(dat$Species)

table(dat$Species) #class imbalances

setosa versicolor 
    50         20     

set.seed(100)
smote_train <- SMOTE(Species ~ ., data  = dat)                         
table(smote_train$Species)

setosa versicolor 
    80         60 


set.seed(100)
rose_train <- ROSE(Species ~ ., data  = dat)$data    
table(rose_train$Species)


setosa versicolor 
    37         33

【讨论】：