zoo::rollapply 窗口超过列值而不是行答案

【问题标题】：zoo::rollapply window with over column values rather than rowszoo::rollapply 窗口超过列值而不是行
【发布时间】：2021-08-01 15:07:27
【问题描述】：

dat = structure(list(index = c(10505L, 10506L, 10511L, 10539L, 10542L, 
10579L, 10642L, 11008L, 11012L, 13011L, 13110L, 13116L, 13118L, 
13156L, 13259L, 13273L, 13313L, 13365L, 13380L, 13382L, 13445L, 
13453L, 13482L, 13483L, 13494L, 13543L, 13550L, 14462L, 14464L, 
14564L, 14599L, 14604L, 14674L, 14719L, 14728L, 14775L, 14860L, 
14874L, 14930L, 14933L, 14975L, 15031L, 15089L, 15117L, 15179L, 
15211L, 15241L, 15245L, 15255L, 15260L, 15418L, 15585L, 15627L, 
15644L, 15774L, 15776L, 15777L, 15790L, 15791L, 15833L, 15849L, 
15850L, 15886L, 16042L, 16127L, 16140L, 16141L, 16142L, 16365L, 
16485L, 16489L, 16515L, 16542L, 16738L, 16834L, 16949L, 17272L, 
17462L, 17569L, 17571L, 17641L, 17654L, 17694L, 17695L, 17709L, 
17748L, 17836L, 17922L, 18643L, 20113L, 20131L, 28914L, 29318L, 
30524L, 30741L, 30912L, 30923L, 30998L, 46650L, 46698L), V2 = c(3L, 
3L, 3L, 2L, 2L, 2L, 2L, 1L, 0L, 3L, 2L, 2L, 2L, 0L, 1L, 1L, 0L, 
0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 
0L, 0L, 1L, 2L, 2L, 2L, 2L, 1L, 0L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 
0L, 0L, 0L, 2L, 3L, 5L, 3L, 0L, 0L, 3L, 1L, 0L, 3L, 0L, 0L, 2L, 
1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 2L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 
1L, 1L, 1L)), row.names = c(NA, -100L), class = "data.frame")

假设我想在滚动窗口中跨 dat 计算一个函数。

n_sites = function(x) {
    return(sum(x > 1))
}

zoo::rollapply(dat$V2, FUN=n_sites, width=100)

但是，我不想使用行数作为窗口大小，而是使用index 列中的实际数值。所以我想让每个窗口在索引列中包含大约 100 个单位。鉴于在第 1 行和第 7 行之间大约有 100 个单位的index，第一个窗口将包括这些行。这可能吗？

很高兴使用zoo 或data.table 或类似的解决方案。

【问题讨论】：

标签： r zoo rolling-computation rollapply

【解决方案1】：

您也可以使用包 runner，其中参数 idx 正是您要查找的内容

dat$n_sites <- runner::runner(x = dat$V2,
                              idx = dat$index,
                              k = 100,
                              f = n_sites)

head(dat, 10)
   index V2 n_sites
1  10505  3       1
2  10506  3       2
3  10511  3       3
4  10539  2       4
5  10542  2       5
6  10579  2       6
7  10642  2       2
8  11008  1       0
9  11012  0       0
10 13011  3       1

【讨论】：

【解决方案2】：

rollapply 中的宽度可以是一个向量，使得第 i 个元素是用于第 i 行的宽度。这个问题有多种解释。我们可以使用不超过 100 个索引单元的最大宽度、至少 100 个索引单元的最小宽度或最接近 100 个索引单元的宽度。该问题似乎要求第三个，但示例宽度 7 与此不一致，并表明可能需要第二种解释。我们在最后给出所有三个宽度。选择你想要的。问题还说第一个窗口是 7，这表明需要左对齐。

library(zoo)

w <- w2 # see calcs of w1, w2 and w3 at end.  Use whichever you want.
transform(dat, roll = rollapplyr(V2, w, n_sites, fill = NA, align = "left"))

如果 n_sites 只是实际函数的替代，那么我们可以使用上面的，但如果它是实际函数，我们可以消除它并像这样编写它：

transform(dat, roll = rollapplyr(V2 > 1, w, sum, fill = NA, align = "left"))

宽度

这有很多变化是可能的，我们计算了这里提到的三个。

下面的代码使用base R 的findInterval。回想一下 findInterval(x, vec)，其中 x 和 vec 是向量并且 vec 是非递减的，它返回一个与 x 长度相同的向量，因此结果的第 i 个分量是 sum(x[i] >= vec) 但是做得更有效。也就是说，如果在 vec 中找到 x[i]，那么它会在 vec 中找到等于 x[i] 的最后一个位置，如果 x[i] 不在 vec 中，那么它会在 vec 中找到小于 x[一世]。请注意，它返回位置，即索引，而不是 vec 的值。例如，findInterval(c(20, 30), c(10, 30, 30, 30, 40)) 返回 c(1, 4)，因为 1 是 vec 中小于 20 的最大值的位置，而 4 是vec 中最后一个值的位置等于 30。

n <- nrow(dat)
index <- dat$index

# i1 is row number of last index no more than current index + 100
i1 <- findInterval(index + 100, index)
w1 <- i1 - 1:n + 1

# i2 is row number of first index at least equal to index + 100
i2 <- pmin(findInterval(index + 100 - 1, index) + 1, n)
w2 <- i2 - 1:n + 1
w2[1]
## [1] 7

# i is row number of index closest to current index + 100
i <- ifelse(index + 100 - index[i1] <= index[i2] - (index + 100), i1, i2)
w3 <- i - 1:n + 1

【讨论】：

【解决方案3】：

您可以使用slider::slide_index 代替zoo::rollapply：

library(slider)
dat$n_sites <- slider::slide_index(.x = dat$V2,
                                   .i = dat$index,
                                   .f = n_sites,
                                   .before = 100)

head(dat,10)
   index V2 n_sites
1  10505  3       1
2  10506  3       2
3  10511  3       3
4  10539  2       4
5  10542  2       5
6  10579  2       6
7  10642  2       3
8  11008  1       0
9  11012  0       0
10 13011  3       1

【讨论】：

为什么 row7 输出为 3？当窗口大小为 100 而不是 101 时？
@Anilgoyal, .before 是当前索引之前的值的数量，因此 10642-100 下降到 10542。根据 OP 的预期，参数值可能是 99
好的。感谢您的澄清。 :)