我可以根据同一 tibble 中另一列的值过滤一列吗？答案

【问题标题】：Can I filter a column based on the values of another column within the same tibble?我可以根据同一 tibble 中另一列的值过滤一列吗？
【发布时间】：2020-10-18 23:03:25
【问题描述】：

我正在处理包含无数行的大型数据集，并且正在尝试自动化我的一些分析。我主要使用#tidyverse 来减少添加更多包的需要，但我愿意接受所有建议。考虑以下小标题：

id <- rep(1:3, each = 48) # 3 individuals
time <- rep(seq(0, 23.5, by = .5), 3) 
count <- runif(48*3)
df <- tibble(id, time, count)

我正在尝试过滤最大计数时间前后的 2 小时间隔。我可以使用以下方法确定最大计数的时间：

df %>% 
  group_by(id) %>%
  filter(count == max(count))
# OR
df$time[which.max(df$count)] # Only for 1 id, though

我正在努力过滤最大计数时间附近的范围。我可以使用 Base R 将时间正确识别为向量，但我无法过滤整行。我还没有为潜在的负值或缺失值做好准备。

df$time[(which.max(df$count) - 2):(which.max(df$count) + 2)]

我正在使用 mutate() 计算几个不同的变量，因此我想将此 filter() 合并到管道中。我尝试使用 between()、match()、lead() 和 lag()。 which.max() 是我最接近过滤正确持续时间的方法。以下是死胡同，也是我最接近、正确的尝试：

# Listed max(count) in a new column; maybe use for matching?
df %>% 
  group_by(id) %>%
  mutate(peak = max(count))

# Partially selects time around max count, but not accurately.
df %>% 
  group_by(id) %>%
  filter(time == time[(which.max(count) - 1.5):(which.max(count)+1.5)])

我已经编写了大约一年的代码，但我认为我缺少一些我不知道的基本功能。已经针对 SQL 发布了类似的问题，但我没有找到任何关于 R 或 tidyverse 的问题。如果您能提供帮助，我将不胜感激。让我知道是否需要任何说明。

【问题讨论】：

@akrun 小错误，但我修正了！

标签： r filter max tidyverse

【解决方案1】：

我们可以在分组步骤之后使用slice

library(dplyr)
df %>% 
    group_by(id) %>% 
    slice({i1 <- which.max(count)
            (i1 -2):(i1 + 2)})
# A tibble: 15 x 3
# Groups:   id [3]
#      id  time  count
#   <int> <dbl>  <dbl>
# 1     1   6.5 0.447 
# 2     1   7   0.785 
# 3     1   7.5 0.984 
# 4     1   8   0.133 
# 5     1   8.5 0.433 
# 6     2  14.5 0.266 
# 7     2  15   0.501 
# 8     2  15.5 0.965 
# 9     2  16   0.214 
#10     2  16.5 0.492 
#11     3  14   0.894 
#12     3  14.5 0.0388
#13     3  15   0.947 
#14     3  15.5 0.776 
#15     3  16   0.293

或者可以做得更紧凑

df %>%
    group_by(id) %>%
    slice(which.max(count) + (-2:2))

【讨论】：

【解决方案2】：

使用row_number() 的替代解决方案

library(dplyr)

df %>%
  group_by(id) %>%
  filter(abs(row_number() - which.max(count)) <= 2)

给了

# A tibble: 15 x 3
# Groups:   id [3]
      id  time  count
   <int> <dbl>  <dbl>
 1     1   5   0.574 
 2     1   5.5 0.763 
 3     1   6   0.985 
 4     1   6.5 0.701 
 5     1   7   0.281 
 6     2  21   0.0563
 7     2  21.5 0.274 
 8     2  22   0.978 
 9     2  22.5 0.560 
10     2  23   0.726 
11     3  12   0.889 
12     3  12.5 0.767 
13     3  13   0.999 
14     3  13.5 0.157 
15     3  14   0.896

【讨论】：