【发布时间】:2021-07-27 20:56:11
【问题描述】:
我正在尝试使用不平衡数据的两种采样方法。 我使用了“Caret”包的“upSample”功能,一切顺利。 但是,当我使用“downSample”函数时,出现以下错误:
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
我使用的命令语法是:
downtrain_eli=downSample(x=trainset_eli[,-16],
y=trainset_eli$Comportamento)
“trainset_eli”有 34 列和 70.800 行
当我使用随机森林模型来预测多类 (6) 响应变量时,我正在测试这两个函数(上采样和下采样)以保持数据平衡。但是,我看到“Caret”包还包含“train”功能,具有更多平衡数据的选项。但是这个函数是一个模型类型的函数,我只是想要这个函数来创建一个包含平衡数据的数据集,然后在我的随机森林模型中使用它。我是继续使用“上下”功能还是使用“火车”功能更好?如果是这样,我该如何在我的随机森林模型中实现它?
str(trainset_eli)
$ date : chr "01/10/2019" "24/09/2019" "01/10/2019" "01/10/2019" ...
$ air.temp : num 18.4 32.6 34.5 26.4 32.6 ...
$ relat.u : num 70 30.4 22.2 50.7 30.8 ...
$ wind.sp : num 1.14 2.81 1.51 3.33 2.17 ...
$ wind.dir : num 79.1 341.6 350.1 56.2 294.9 ...
$ solar.rad : num 39.6 741 433.9 621.1 274.6 ...
$ max.raj : num 1.65 5.25 2.85 6.05 4.45 ...
$ time : chr "06:40:00" "14:10:00" "14:40:00" "09:20:00" ...
$ timedate : POSIXct, format: "2019-10-01 06:43:48" "2019-09-24 14:10:45" "2019-10-01 14:48:50" ...
$ sensorid : int 67 65 66 70 70 70 69 68 69 65 ...
$ x : int -56 -49 15 35 -4 27 -40 33 -29 -47 ...
$ y : int -11 0 -4 24 10 34 -43 4 -4 5 ...
$ z : int -27 -37 -56 -20 -16 -44 -51 -49 -53 -41 ...
$ i.date : chr "01/10/2019" "24/09/2019" "01/10/2019" "01/10/2019" ...
$ i.time : chr "06:43:48" "14:10:45" "14:48:50" "09:21:41" ...
$ Comportamento: Factor w/ 6 levels "1","2","4","5",..: 6 3 3 5 2 2 1 1 2 1 ...
$ xg : num -0.875 -0.7656 0.2344 0.5469 -0.0625 ...
$ yg : num -0.1719 0 -0.0625 0.375 0.1562 ...
$ zg : num -0.422 -0.578 -0.875 -0.312 -0.25 ...
$ SMA : num 1.469 1.344 1.172 1.234 0.469 ...
$ SVM : num 0.986 0.959 0.908 0.733 0.301 ...
$ mov.var : num 0.0625 0.1094 0.0469 1.0156 1 ...
$ energy : num 0.94701 0.84715 0.67974 0.28875 0.00825 ...
$ entropy : num 0.2526 0.1219 0.0354 0.8179 0.0172 ...
$ pitch : num 62.5 52.9 -15 -48.2 12 ...
$ roll : num -158 180 -176 130 148 ...
$ inclination : num -64.7 -52.9 -15.5 -64.8 -33.9 ...
$ year : num 2019 2019 2019 2019 2019 ...
$ month : num 10 9 10 10 9 10 10 10 10 10 ...
$ day : int 1 24 1 1 24 1 1 1 1 1 ...
$ dayofweek : num 3 3 3 3 3 3 3 3 3 3 ...
$ hour : int 6 14 14 9 16 13 6 16 7 6 ...
$ minute : int 43 10 48 21 38 35 43 48 20 36 ...
$ second : num 48 45 50 41 45 16 36 13 43 57 ...
> dput(head(trainset_eli))
structure(list(date = c("01/10/2019", "24/09/2019", "01/10/2019",
"01/10/2019", "24/09/2019", "01/10/2019"), air.temp = c(18.42,
32.63, 34.54, 26.42, 32.63, 34.44), relat.u = c(70, 30.45, 22.19,
50.69, 30.83, 25.67), wind.sp = c(1.136, 2.809, 1.512, 3.326,
2.171, 2.04), wind.dir = c(79.1, 341.6, 350.1, 56.22, 294.9,
16.57), solar.rad = c(39.62, 741, 433.9, 621.1, 274.6, 847),
max.raj = c(1.647, 5.247, 2.847, 6.047, 4.447, 4.447), time = c("06:40:00",
"14:10:00", "14:40:00", "09:20:00", "16:30:00", "13:30:00"
), timedate = structure(c(1569912228, 1569334245, 1569941330,
1569921701, 1569343125, 1569936916), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), sensorid = c(67L, 65L, 66L, 70L,
70L, 70L), x = c(-56L, -49L, 15L, 35L, -4L, 27L), y = c(-11L,
0L, -4L, 24L, 10L, 34L), z = c(-27L, -37L, -56L, -20L, -16L,
-44L), i.date = c("01/10/2019", "24/09/2019", "01/10/2019",
"01/10/2019", "24/09/2019", "01/10/2019"), i.time = c("06:43:48",
"14:10:45", "14:48:50", "09:21:41", "16:38:45", "13:35:16"
), Comportamento = structure(c(6L, 3L, 3L, 5L, 2L, 2L), .Label = c("1",
"2", "4", "5", "6", "7"), class = "factor"), xg = c(-0.875,
-0.765625, 0.234375, 0.546875, -0.0625, 0.421875), yg = c(-0.171875,
0, -0.0625, 0.375, 0.15625, 0.53125), zg = c(-0.421875, -0.578125,
-0.875, -0.3125, -0.25, -0.6875), SMA = c(1.46875, 1.34375,
1.171875, 1.234375, 0.46875, 1.640625), SVM = c(0.986480882354037,
0.959380089563047, 0.907999389110477, 0.733044006608744,
0.30136408628103, 0.965847466282849), mov.var = c(0.0625,
0.109375, 0.046875, 1.015625, 1, 0.078125), energy = c(0.947010278701782,
0.847154855728149, 0.679739058017731, 0.288748800754547,
0.00824832916259766, 0.870230257511139), entropy = c(0.252618304422212,
0.121902803377891, 0.0354050216019417, 0.817915633557388,
0.0171719387098626, 0.109209155417093), pitch = c(62.4975813343597,
52.9434718105904, -14.9586823290351, -48.247900416119, 11.9694631246073,
-25.8994130495892), roll = c(-157.833654177918, 180, -175.914383220025,
129.805571092265, 147.994616791916, 142.305759533311), inclination = c(-64.6810700998259,
-52.9434718105904, -15.4942996397858, -64.7667344528855,
-33.9462950277539, -44.6176169165428), year = c(2019, 2019,
2019, 2019, 2019, 2019), month = c(10, 9, 10, 10, 9, 10),
day = c(1L, 24L, 1L, 1L, 24L, 1L), dayofweek = c(3, 3, 3,
3, 3, 3), hour = c(6L, 14L, 14L, 9L, 16L, 13L), minute = c(43L,
10L, 48L, 21L, 38L, 35L), second = c(48, 45, 50, 41, 45,
16)), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x56139e8dcfc0>, class = c("data.table",
"data.frame"))
【问题讨论】:
-
您好,您能否添加
dput(head(trainset_eli))而不是str(trainset_eli),以便我们尝试重现您的问题?谢谢 -
喜欢这个?编辑主要问题
-
应该可以,我尝试了一个示例数据集,但无法重现您的错误
标签: r random-forest r-caret sampling imbalanced-data