如何矢量化 while 循环？答案

【问题标题】：How can I vectorize a while loop?如何矢量化 while 循环？
【发布时间】：2017-03-13 13:07:49
【问题描述】：

考虑以下代码：

vectorize.me = function(history, row.idx=1, row.val=0, max=100){
  while (row.idx < max & row.val < max) {
    row.idx <- row.idx + 1
    entry <- paste('row.idx: ', row.idx, ' row.val: ', row.val)
    history[row.idx] <- entry
    print(entry)
  }
  return(history)
}

max <- 100
history <- vectorize.me(vector('list', max), max=max)

我想做以下事情：

我不想传递 row.idx 和 row.val 参数，而是将数据帧传递给 vectorize.me 函数，并让函数对数据帧的每一行 idx 和 row val 进行操作。
删除while 循环，并在满足条件时简单地停止迭代。
完成迭代后返回history列表。

我该怎么做？

df <- data.frame(sample(0:100,1000,rep=TRUE))
history <- vectorize.me(df, vector('list', max), max=max)

编辑：这是一个完全人为的例子。我设计它是因为我想要一些示例代码，它将值传递给矢量化代码内部的下一个“迭代”（即 apply、lapply、mapply 等）

【问题讨论】：

在您的代码中 - 您没有更改函数内部的 row.val 吗？为什么在while循环的条件下使用它？对于随机，是否有需要随机停止的原因（在这种情况下不依赖于数据）？
如果循环中的迭代次数是随机的，则无法对其进行矢量化。你可以很容易地在 Rcpp 中实现它。
要打印随机值的事实是这里的主要绊脚石。您的随机值导致以 1/100 的固定概率停止。 IE。由于随机停止而允许的行数遵循几何分布。总行数最多为最大值。所以你可以简单地从一个几何分布中采样，如果它大于max，最多做max。然后其余的很容易以矢量化形式编写，因为您现在编写了固定数量的行（但您现在没有明确的“随机”变量）。
@HolgerHoefling：这是一个人为的例子；实际上，row.val 将用于我的矢量化函数内部的计算，并将包含在 while 循环的条件中。
“我的任务是将信息从一次迭代传递到下一次迭代” 这对于一般情况下的矢量化方式是不可能的。对于特定情况，有诸如cumsum、cumprod、cummax、...等矢量化函数。请注意*apply 函数只是隐藏循环，不应被视为“矢量化”。它们更具可读性，但并不比编写良好的 for 循环快。 while 循环在 R 代码中极为罕见。如果您确实需要它，通常应该切换到已编译的代码以提高性能。

标签： r dataframe while-loop vectorization

【解决方案1】：

您可以在一系列零和一上使用cumprod，以获得在原始系列中遇到第一个零值时立即变为0的系列。这可以用来限制history的长度和要打印的项目。

没有作为一个函数，只是简单的代码：

df <- data.frame(ids=seq(1,1000),val=sample(0:100,1000,rep=TRUE))
valmax<-80
pyn<-cumprod(df$val<valmax)
history<-paste("row.idx",df$ids[pyn>0],"row.val",df$val[pyn>0])
print(history)

您可能必须添加一些检查和条件才能将其变成好的代码，但原则上这样可以解决问题

【讨论】：

【解决方案2】：

下面的呢：

vectorize.me <- function(df, var, history, max=100) {
  #-- Compute the max index in df to process (this is the "stopping condition" of the "loop")
  # Find the occurrence of the first index in df[,var] that is larger than 'max'
  # (note the fictitious FALSE and TRUE values added to the condition on df[,var]
  # in order to consider boundary conditions in one go)
  indmax <- min( which( c(FALSE, !df[,var] <= max, TRUE) ) ) - 2

  if (indmax > 0) { # There is at least one index to process
    # Limit indmax to the length of 'history'
    indmax <- min(indmax, length(history))
    ind <- 1:indmax
    entries <- paste('idx:', ind, 'val:', df[ind,var])
    history[ind] <- entries
    print(entries)
  }

  return(history)
}

#-- Test
# Test data
df <- data.frame(x=c(5, 8, 9, 8, 10, 4, 1, 3))

# Run tests
history <- vector('list', 8)
history <- vectorize.me(df, "x", history, max=8)   # first 'max' value is found in a middle row
history <- vectorize.me(df, "x", history, max=4)   # first value in data frame is larger than 'max'
history <- vectorize.me(df, "x", history, max=max(df[,"x"]))      # all values in data frame are <= 'max'
history <- vectorize.me(df, "x", history, max=max(df[,"x"]) + 1)  # 'max' is larger than the maximum value in df[,var]
history <- vector('list', 6)
history <- vectorize.me(df, "x", history, max=max(df[,"x"]))      # 'history' is shorter than the maximum index of df to process

注意事项：

参数var 指定数据框中的列名称，max 条件应用到该列。
不检查输入参数的有效性

【讨论】：