R中更快的代码答案

【问题标题】：Faster code in RR中更快的代码
【发布时间】：2012-03-24 13:38:42
【问题描述】：

仅供参考：自我的第一版以来，我已经对它进行了大量编辑。此模拟已从 14 小时缩短到 14 分钟。

我是编程新手，但我做了一个模拟，试图跟踪生物体中的无性复制并量化亲本生物和子生物之间染色体数的差异。模拟运行非常缓慢。大约需要6个小时才能完成。我想知道使模拟运行得更快的最佳方法是什么。

这些数字生物有 x 条染色体。与大多数生物体不同，染色体都是相互独立的，因此它们被转移到任一子生物体中的机会均等。

在这种情况下，染色体在子细胞中的分布遵循概率为 0.5 的二项分布。

函数sim_repo 采用具有已知染色体数量的数字生物矩阵，并将它们复制12 代。它复制这些染色体，然后使用rbinom 函数随机生成一个数字。然后将该编号分配给子单元。由于在无性繁殖过程中没有染色体丢失，其他子细胞接收剩余的染色体。然后重复 G 代。然后从矩阵的每一行中采样一个值。

 sim_repo = function( x1, G=12, k=1, t=25, h=1000 ) {

            # x1 is the list of copy numbers for a somatic chromosome
            # G is the number of generations, default is 12
            # k is the transfer size, default is 1
            # t is the number of transfers, default is 25
            # h is the number of times to replicate, default is 1000

            dup <- x1 * 2 # duplicate the initial somatic chromosome copy number for replication
            pop <- 1 # set generation time
            set.seed(11)
            z <- matrix(rbinom(n=rep(1,length(dup)),size = as.vector(dup),prob = 0.5),nrow = nrow(dup)) # amount of somatic chromosome is distributed to one of the daughter cells
            z1 <- dup - z # as no somatic chromosomes are lost, the other daughter cells receives the remainder somatic chromosomes
            x1 <- cbind(z, z1) # put both in a matrix

            for ( pop in 1:G ) { # this loop does the replication for each cell in each generation
                pop <- 1 + pop # number of generations.  This is a count for the for loop
                dup <- x1 * 2 # double the somatic chromosomes for replication
                set.seed(11)
                z <- matrix(rbinom(n=rep(1,length(dup)),size = as.vector(dup),prob = 0.5),nrow = nrow(dup)) # amount of somatic c hromosomes distributed to one of the daughter cells
                z1 <- dup - z # as no somatic chromosomes are lost, the other daughter cells receives the remainder somatic chromosomes
                x1 <- cbind(z, z1) # put both in a matrix
                }

            # the following for loop randomly selects one cell in the population that was created
            # the output is a matrix of 1 column
            x1 <- matrix(apply(x1, 1, sample, size=k), ncol=1)
            x1
    }

在我的研究中，我对初始祖先生物的染色体方差变化和此模拟中的最终时间点感兴趣。以下函数表示将细胞转移到新的生活环境中。它从函数sim_rep 获取输出并使用它来生成更多代。然后它找到矩阵的第一列和最后一列中的行之间的方差，并找到它们之间的差异。

    # The following function is mostly the same as I talked about in the description.
    # The only difference is I changed some aspects to take into account I am using
    # matrices and not lists.
    # The function outputs the difference between the intial variance component between
    # 'cell lines' with the final variance after t number of transfers

sim_exp = function( x1, G=12, k=1, t=25, h=1000 ) {

    xn <- matrix(NA, nrow(x1), t)  
    x <- x1
    xn[,1] <- x1
    for ( l in 2:t ) {
        x <- sim_repo( x, G, k, t, h )
        xn[, l] <- x
    }

    colvar <- matrix(apply(xn,2,var),ncol=ncol(xn))
    ivar <- colvar[,1]
    fvar <- colvar[,ncol(xn)]
    deltavar <- fvar - ivar
    deltavar
}

我需要重复这个模拟 h 次。因此，我创建了以下函数，它将调用函数 sim_exp h 次数。

sim_1000 = function( x1, G=12, k=1, t=25, h=1000 ) {
    xn <- vector(length=h)
    for ( l in 2:h ) {
        x <- sim_exp( x1, G, k, t, h )
        xn[l] <- x
    }
        xn
}

当我使用 6 个值调用 sim_exp 函数时，大约需要 52 秒才能完成。

 x1 <- matrix(data=c(100,100,100,100,100,100),ncol=1)
 system.time(sim_1000(x1,h=1))
   user  system elapsed 
  1.280   0.105   1.369

如果我能更快地获得它，那么我可以完成更多这些模拟并在模拟中应用选择模型。

我的输入将如下所示 x1，一个矩阵，每个祖先生物都在自己的行中：

x1 <- matrix(data=c(100,100,100,100,100,100),ncol=1) # a matrix of 6 organisms

当我跑步时：

a <- sim_repo(x1, G=12, k=1)

我的预期输出将是：

 a
     [,1]
[1,]  137
[2,]   82
[3,]   89
[4,]  135
[5,]   89
[6,]  109

 system.time(sim_repo(x1))
   user  system elapsed 
  1.969   0.059   2.010

当我调用 sim_exp 函数时，

它调用sim_repo函数G次并输出：

 b
[1] 18805.47

当我调用sim_1000函数时，我通常会将h设置为1000，但这里我将它设置为2。所以这里sim_1000会调用sim_exp并复制2次。

c <- sim_1000(x1, G=12, k=1, t=25, h=2)
c
[1] 18805.47 18805.47

【问题讨论】：

乍一看，我敢打赌，您的代码运行缓慢的最大原因是您没有预先分配对象：尤其是 sim_exp() 和 @987654340 中的 cbind() @inside sim_1000() 一定很贵。
@flodel，感谢您的提示。你有一个如何在我的代码中预分配的例子吗？例如，在sim_exp() 中，我是否会在最终输出中创建一个列数和行数相同的矩阵，但用NULL 填充值？
R Inferno 中的一章专门讨论这个问题：burns-stat.com/pages/Tutor/R_inferno.pdf
是的@Kev。循环外：xn <- matrix(NA, nrow(x1), t) 和循环内：xn[, l] <- x。在整个代码中，寻找类似的情况是对象通过连续调用c() 或cbind() 增长并使用相同的想法。希望您会看到速度大幅提升。
@Kev - flodel 通过预分配让您走上正确的道路。我也不确定您是否需要致电apply(...,c(1,2),...)。看起来您可以简单地将这些值相乘。如果您提供一些示例输入数据和预期输出，那么提供帮助会更容易。会让人们开发替代方案并仔细检查输出是否仍然正确。

标签： r for-loop simulation vectorization

【解决方案1】：

正如cmets中其他人所说，如果我们只看函数sim_repo，并替换该行：

dup <- apply(x1, c(1,2),"*",2)

与

dup <- x1 * 2

线条

z <- apply(dup,c(1,2),rbinom,n=1,prob=0.5)

与

z <- matrix(rbinom(n=rep(1,length(dup)),size = as.vector(dup),prob = 0.5),nrow = nrow(dup))

和内部for循环

x1 <- matrix(apply(x1,1,sample,size = 1), ncol=1)

我得到了一个很大的速度提升：

system.time(sim_exp(x1))
   user  system elapsed 
  0.655   0.017   0.686 
> system.time(sim_expOld(x1))
   user  system elapsed 
 21.445   0.128  21.530

我证实它正在做同样的事情：

set.seed(123)
out1 <- sim_exp(x1)

set.seed(123)
out2 <- sim_expOld(x1)

all.equal(out1,out2)
> TRUE

这甚至没有深入研究预分配，考虑到您构建代码的方式，如果不完全重新设计事物，这实际上可能会很困难。

这甚至还没有开始看看你是否真的需要所有三个功能......

【讨论】：

我需要使用你的电脑。我还在：system.time(sim_exp(x1, G=12, k=1, t=25, h=1 ))user system elapsed 23.598 0.767 24.390
@Kev 我的电脑不快。这是一岁的macbook air。使用两个处理器选项中较慢的一个。很有可能您只是没有完全正确地修改代码。
我想说谢谢你的帮助。我的模拟现在在大约 13 分钟内运行 1000 次重复。这是一个很好的教训，仅仅因为某些东西有效，并不意味着它是有效的。我现在可以运行很多模拟了。