如何在 ROSE、SMOTE 等数据平衡技术后保留 id答案

【问题标题】：How to preserve id's after data balancing technique like ROSE, SMOTE如何在 ROSE、SMOTE 等数据平衡技术后保留 id
【发布时间】：2018-01-14 23:48:53
【问题描述】：

df1 = data.frame(id=c('A1','2','B3','4','5','6','7','8','9','10'),s1c1=c(0,0.2,0,0.5,0.8,0,0,0,0,0),s1c2=c(0,0,0.3,0,0,0.9,0.3,0,0,0),s1c3=c(0.1,0,0,0,0,0,0,0.2,0.8,0.1))
df2 = data.frame(id=c('A1','2','B3','4','5','6','7','8','9','10'),s2c1=c(0,0.22,0,0.35,0.8,0,0,0,0,0),s2c2=c(0,0,0.23,0,0,0.7,0.3,0,0,0),s2c3=c(0.2,0,0,0,0,0,0,0.4,0.9,0.4))
df <- merge(df1,df2, by="id",all=TRUE)
df$class <- c(0,0,0,0,0,1,1,0,0,0) 
> df
  id s1c1 s1c2 s1c3 s2c1 s2c2 s2c3 class
  10  0.0  0.0  0.1 0.00 0.00  0.4     0
   2  0.2  0.0  0.0 0.22 0.00  0.0     0
   4  0.5  0.0  0.0 0.35 0.00  0.0     0
   5  0.8  0.0  0.0 0.80 0.00  0.0     0
   6  0.0  0.9  0.0 0.00 0.70  0.0     0
   7  0.0  0.3  0.0 0.00 0.30  0.0     1
   8  0.0  0.0  0.2 0.00 0.00  0.4     1
   9  0.0  0.0  0.8 0.00 0.00  0.9     0
  A1  0.0  0.0  0.1 0.00 0.00  0.2     0
  B3  0.0  0.3  0.0 0.00 0.23  0.0     0

我正在使用 ROSE 函数为不平衡数据生成样本。但是，我想在 ROSE 之后为来自 df 的每个观察保留 id。使用 ROSE 后，我的输出低于输出。

 df.rose <- ROSE(class ~ ., data=df, seed=123,N=20,p=0.25)$data

> df.rose
 id        s1c1         s1c2          s1c3        s2c1         s2c2        s2c3   class
 B3 -0.24636399  0.513435064 -0.0844105623  0.04695640  0.419960189  0.08112992     0
  9 -0.05029030  0.199689698  0.7022285344  0.08255245 -0.133951228  1.16820765     0
  9 -0.23671562  0.167377715  0.9634146745 -0.10923003 -0.129948534  1.00641398     0
 B3 -0.16816685  0.434632663 -0.0174671002 -0.07245581  0.423706144 -0.07969934     0
  9 -0.14420654 -0.015047974  0.8530741203 -0.22148879 -0.053786877  1.18091542     0
  9 -0.38914709 -0.074365870  0.7940190162 -0.23306056 -0.230564666  1.14293933     0
  6  0.19329086  0.807524478 -0.0089820194  0.06600218  0.734243934  0.13409831     0
  6  0.03538563  0.731147735  0.2867432037  0.09746303  0.673766711  0.05837655     0
  4  0.23741363 -0.050535412 -0.0473024899  0.36152575  0.001088718 -0.15354050     0
  2  0.48927513 -0.307561385  0.3177238885  0.42054668  0.072770343  0.33271737     0
 B3  0.09839211  0.827176406 -0.3244875053  0.44579006  0.159991098 -0.14678016     0
 B3 -0.06807770  0.593601657  0.1224855617 -0.10677452  0.351707470  0.53486376     0
  9  0.20651979 -0.272977578  0.8259493668 -0.50212781 -0.041644690  1.27476593     0
  8  0.00000000 -0.008315345  0.0008152742  0.00000000  0.043469230  0.29596908     1
  7  0.00000000  0.155050387 -0.0068404803  0.00000000  0.314397160 -0.50556877     1
  7  0.00000000 -0.008021610  0.0639465277  0.00000000  0.122372337  0.27856790     1
  8  0.00000000 -0.070217063  0.2370763279  0.00000000 -0.013168583  0.04034823     1
  7  0.00000000  0.469712631  0.0130102656  0.00000000  0.566767608  0.18219645     1
  7  0.00000000  0.193749720 -0.0788801623  0.00000000  0.383380004  0.47007644     1
  7  0.00000000  0.412273782 -0.1046108759  0.00000000  0.307614552 -0.35552820     1

在 ROSE 之后我没有得到所有的 id。我想得到我所有的身份证。如果有人知道通过为每个观察保留 id 来处理不平衡数据的任何其他方法。我不想弄乱身份证。我尝试过过采样、欠采样、SMOTE。但是，没有好的结果。我尝试将 id 列转换为因子，但没有成功。

【问题讨论】：

您期望输出什么？过采样会产生更多的样本，那么这些样本的 id 是什么？
我想要所有这些原始 ID。如果它创建重复的 id 是可以的，但我希望它们都在我的重采样数据中
例如。在我的原始集合中，我有 id 为 5 的行，但在重新采样的数据中，id 为 5 的行丢失了。

标签： r machine-learning classification resampling

【解决方案1】：

如果有人仍然想知道，我最终使用了这种方法。我只想要新的综合观察结果，但 SMOTE 不断减小我的数据集的大小。希望对您有所帮助：

library(DMwR)
library(dplyr)

# df - dataframe you want to use over/undersampling on

df$ID <- seq.int(nrow(df))
df_smote <- DMwR::SMOTE(var ~ ., df, perc.over = 100, k = 5)
sub_df <- subset(df_smote, var == "yes")
final_df <- rbind(df, sub_df)
final_df <- distinct(final_df)

创建 ID 列，以确保行完全相同（不是具有相同特征集的观察）

使用带有所需参数的 SMOTE（其中 var 是二进制变量不平衡）。

用一定级别的 var 对合成观测值进行子集 - 在本案例“是”级别。

行绑定子集到原始数据集。

删除 SMOTE 中引入的重复项。

你最终得到的原始数据集只有综合观察具有所需的过采样/欠采样电平。

【讨论】：