使用 apply 和 rbind 构建 R data.frame答案

【问题标题】：Using apply and rbind to build an R data.frame使用 apply 和 rbind 构建 R data.frame
【发布时间】：2011-09-17 11:03:52
【问题描述】：

我有一个现有的 data.frame，其中包含一些初始值。我想要做的是创建另一个 data.frame，其中第一个 data.frame 中的每一行都有 10 个随机采样的行。此外，我正在尝试以 R 方式执行此操作，因此我想避免迭代。

到目前为止，我已经设法将一个函数应用于生成一个值的表中的每一行，但是我不确定如何将其扩展为每个应用程序生成 10 行，然后将结果重新绑定在一起。

这是我目前的进展：

样本数据：

   starts <- structure(list(instance = structure(21:26, .Label = c("big_1", 
   "big_10", "big_11", "big_12", "big_13", "big_14", "big_15", "big_16", 
   "big_17", "big_18", "big_19", "big_2", "big_20", "big_3", "big_4", 
   "big_5", "big_6", "big_7", "big_8", "big_9", "competition01", 
   "competition02", "competition03", "competition04", "competition05", 
   "competition06", "competition07", "competition08", "competition09", 
   "competition10", "competition11", "competition12", "competition13", 
   "competition14", "competition15", "competition16", "competition17", 
   "competition18", "competition19", "competition20", "med_1", "med_10", 
   "med_11", "med_12", "med_13", "med_14", "med_15", "med_16", "med_17", 
   "med_18", "med_19", "med_2", "med_20", "med_3", "med_4", "med_5", 
   "med_6", "med_7", "med_8", "med_9", "small_1", "small_10", "small_11", 
   "small_12", "small_13", "small_14", "small_15", "small_16", "small_17", 
   "small_18", "small_19", "small_2", "small_20", "small_3", "small_4", 
   "small_5", "small_6", "small_7", "small_8", "small_9"), class = "factor"), 
   event.clashes = c(674L, 626L, 604L, 1036L, 991L, 929L), overlaps = c(0L, 
   0L, 0L, 0L, 0L, 0L), room.valid = c(324L, 320L, 268L, 299L, 
   294L, 220L), final.timeslot = c(0L, 0L, 0L, 0L, 0L, 0L), 
   three.in.a.row = c(246L, 253L, 259L, 389L, 365L, 430L), single.event = c(97L, 
   120L, 97L, 191L, 150L, 138L)), .Names = c("instance", "event.clashes", 
   "overlaps", "room.valid", "final.timeslot", "three.in.a.row", 
   "single.event"), row.names = c(NA, 6L), class = "data.frame")

代码：

   library(reshape)
   m.starts <- melt(starts)

   df <- data.frame()

   gen.data <- function(x){
       inst <- x[1]
       constr <- x[2]
       v <- as.integer(x[3])
       val <- as.integer(rnorm(1, max(0, v), v / 2))
       # Should probably return a data.frame here
       print(paste(inst, constr, val))
   }

   apply(m.starts, 1, gen.data)

【问题讨论】：

您的问题是什么？您的 gen.data 函数应该返回一个值。目前它打印一个值但什么也不返回。
我希望 gen.data 函数返回一个填充了 10 行的 data.frame。然后我希望外部应用将所有这 10 个行块连接到一个 data.frame 中。打印只是作为占位符。
您能否提供一个所需输出的示例（即您想要得到的）？

标签： r

【解决方案1】：

我不清楚您到底在做什么，但是对您的 gen_data 函数的以下更改似乎可以满足您的需求。具体来说，我不清楚你在用val 做什么，因为这似乎只是生成一个随机数，其中该行的值列的平均值和该值的标准差除以二。那是你要的吗？我在你的函数中添加了一个新参数来说明你想要生成的行数：

gen.data <- function(x, nreps = 10){
    inst <- x[1]
        constr <- x[2]
        v <- as.integer(x[3])
        val <- as.integer(rnorm(nreps, max(0, v), v / 2))

        out <- data.frame(inst = rep(inst, nreps)
            , constr = rep(constr, nreps)
         , val = val)

    return(out)
       }

然后在使用中：

do.call("rbind", apply(m.starts, 1, gen.data))

结果：

             inst         constr  val
1   competition01  event.clashes  876
2   competition01  event.clashes  714
3   competition01  event.clashes  912
4   competition01  event.clashes  -46
5   competition01  event.clashes  369
....
....
357 competition06   single.event  149
358 competition06   single.event  248
359 competition06   single.event  128
360 competition06   single.event  168

【讨论】：

谢谢，这正是我想要的。

【解决方案2】：

不需要apply 或rbind。只需要一个简单的向量子集：

samples <- sample(1:nrow(starts), nrow(starts)*10, replace=TRUE)
starts[samples, 1:3]

前5行结果：

> head(starts[samples, 1:3], 5)

         instance event.clashes overlaps
2   competition02           626        0
5   competition05           991        0
6   competition06           929        0
4   competition04          1036        0
2.1 competition02           626        0

【讨论】：

我不认为这个答案（目前）解释了上面 OPs 函数中的val 列，这诚然有点模棱两可。另请注意，您的答案将产生 60 行数据，如果他希望（融化的）data.frame 的每一行有 10 行，则使用上面的融化 data.frame 应该产生 360。

【解决方案3】：

您可以将 Andrie 和 Chase 的解决方案的想法结合如下：

#Repeat each row ten times
start.m1 <- start.m[rep(1:nrow(start.m),each = 10),]

#Create extended vector to use to define 
# means/sd
m <- rep(start.m$value,each = 10)

#Remove negative values; 
# although none were in your data
m[m <= 0] <- 0

#Replace value with rnorm values
start.m1$value <- rnorm(nrow(start.m1), mean = m, sd = m / 2)

这会产生如下所示的内容：

> head(start.m1)
         instance      variable     value
1   competition01 event.clashes 1098.0220
1.1 competition01 event.clashes 1208.4304
1.2 competition01 event.clashes  883.7976
1.3 competition01 event.clashes  365.1396
1.4 competition01 event.clashes  862.3113
1.5 competition01 event.clashes 1352.7085

我正在使用 Andrie 的建议来使用子集索引来扩展数据框，然后是 Chase 对您的问题的解释，您似乎希望这些值实际上是通过 rnorm 生成的，而不是对原始行本身重新采样.这里的关键是rnorm是矢量化的。

【讨论】：