R：行重采样循环速度提高答案

【问题标题】：R: Row resampling loop speed improvementR：行重采样循环速度提高
【发布时间】：2013-06-14 23:18:20
【问题描述】：

我正在从具有各种 c("s_size","reps") 列的 c("x","y","density") 列的数据帧中对行进行二次抽样。 Reps= 复制，s_size= 从整个数据帧中二次采样的行数。

> head(data_xyz)
   x y density
1  6 1       0
2  7 1   17600
3  8 1   11200
4 12 1   14400
5 13 1       0
6 14 1    8000



 #Subsampling###################
    subsample_loop <- function(s_size, reps, int) {
      tm1 <- system.time( #start timer
    {
      subsample_bound = data.frame()
    #Perform Subsampling of the general 
    for (s_size in seq(1,s_size,int)){
      for (reps in 1:reps) {
        subsample <- sample.df.rows(s_size, data_xyz)
         assign(paste("sample" ,"_","n", s_size, "_", "r", reps , sep=""), subsample)
        subsample_replicate <- subsample[,] #temporary variable
        subsample_replicate <- cbind(subsample, rep(s_size,(length(subsample_replicate[,1]))),
                                     rep(reps,(length(subsample_replicate[,1]))))
        subsample_bound <- rbind(subsample_bound, subsample_replicate)

      }
    }
    }) #end timer
      colnames(subsample_bound) <- c("x","y","density","s_size","reps")
    subsample_bound
    } #end function

Here's the function call:

    source("R/functions.R")
    subsample_data <- subsample_loop(s_size=206, reps=5, int=10)

这是行子样本函数：

# Samples a number of rows in a dataframe, outputs a dataframe of the same # of columns
# df Data Frame
# N number of samples to be taken
sample.df.rows <- function (N, df, ...) 
  { 
    df[sample(nrow(df), N, replace=FALSE,...), ] 
  }

太慢了，我用apply函数试了几次都没有运气。从 1:250 开始，我将对每个 s_size 进行大约 1,000-10,000 次重复。

让我知道你的想法！提前致谢。

================================================ =========================== 更新编辑：从中采样的样本数据： https://www.dropbox.com/s/47mpo36xh7lck0t/density.csv

Joran 在函数中的代码（在源函数.R 文件中）：

foo <- function(i,j,data){
  res <- data[sample(nrow(data),i,replace = FALSE),]
  res$s_size <- i
  res$reps <- rep(j,i)
  res
}
resampling_custom <- function(dat, s_size, int, reps) {
  ss <- rep(seq(1,s_size,by = int),each = reps)
  id <- rep(seq_len(reps),times = s_size/int)
  out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
}

调用函数

set.seed(2)
out <- resampling_custom(dat=retinal_xyz, s_size=206, int=5, reps=10)

输出数据，不幸的是带有此警告消息：

Warning message:
In mapply(foo, i = ss, j = id, MoreArgs = list(data = dat), SIMPLIFY = FALSE) :
  longer argument not a multiple of length of shorter

【问题讨论】：

如果您尝试过 *apply 函数，为什么不向我们展示该代码？另外，您是否有允许并行实现的硬件？
我有并行兼容的硬件 - 你会建议我如何解决这个问题？
从 joran 的回答开始，然后看看你是否想从 package parallel 中进一步优化。

标签： r loops apply resampling

【解决方案1】：

我很少考虑实际优化这个，我只是专注于做一些至少合理的事情，同时匹配你的程序。

你的大问题是你正在通过rbind 和cbind 增长对象。基本上，无论何时你看到有人写 data.frame() 或 c() 并使用 rbind、cbind 或 c 扩展该对象，你可以非常确定生成的代码基本上是最慢的执行方式正在尝试执行任务。

这个版本快了大约 12-13 倍，如果你认真考虑的话，我相信你可以从中挤出更多：

s_size <- 200
int <- 10
reps <- 30

ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)

foo <- function(i,j,data){
    res <- data[sample(nrow(data),i,replace = FALSE),]
    res$s_size <- i
    res$reps <- rep(j,i)
    res
}

out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))

R 最棒的地方在于，这种方式不仅速度更快，而且代码更少。

【讨论】：

我收到此警告错误：警告消息：在 mapply(foo, i = ss_id, j = rep_id, MoreArgs = list(data = dat), : 更长的参数不是长度的倍数更短我想确保此方法生成正确的数据。非常感谢 joran
@user2438134 我根本没有收到任何警告。除非您提供可重现的示例，否则我不会再提供任何帮助。
@user2438134 这很容易调试。你应该学习如何使用browser()。只需将times = s_size/int 更改为times = ceiling(s_size/int)。