过滤掉某个特定列的某个值以下的df的所有行答案

【问题标题】：Filter out all rows of a df below a certain value of a particular column过滤掉某个特定列的某个值以下的df的所有行
【发布时间】：2021-11-15 11:20:21
【问题描述】：

我有一个 df，它有两列，time 和 val。 df 是按时间排列的。我想从最大值中过滤掉所有行，在本例中为1.29。我在下面提供了示例：

library(tidyverse)
library(lubridate)


# This is the entire df
df1 <- tibble::tribble(
  ~date, ~val,
  "2021-09-16 11:02:45", 1.21,
  "2021-09-16 11:02:45", 1.21,
  "2021-09-16 11:02:45", 1.21,
  "2021-09-16 11:02:45", 1.22,
  "2021-09-16 11:02:45", 1.22,
  "2021-09-16 11:02:45", 1.22,
  "2021-09-16 11:02:37", 1.22,
  "2021-09-16 10:59:29", 1.29,
  "2021-09-16 10:59:14", 1.29,
  "2021-09-16 10:59:14", 1.28,
  "2021-09-16 10:59:14", 1.28,
  "2021-09-16 10:58:17", 1.28,
  "2021-09-16 10:58:17", 1.28,
  "2021-09-16 10:58:05", 1.26,
  "2021-09-16 10:58:05", 1.26,
  "2021-09-16 10:58:05", 1.23,
  "2021-09-16 10:57:16", 1.23
  
  ) %>%
  mutate(date = ymd_hms(date))


# This is the outcome I am looking for
tibble::tribble(
  ~date, ~val,
  "2021-09-16 10:59:29", 1.29,
  "2021-09-16 10:59:14", 1.29,
  "2021-09-16 10:59:14", 1.28,
  "2021-09-16 10:59:14", 1.28,
  "2021-09-16 10:58:17", 1.28,
  "2021-09-16 10:58:17", 1.28,
  "2021-09-16 10:58:05", 1.26,
  "2021-09-16 10:58:05", 1.26,
  "2021-09-16 10:58:05", 1.23,
  "2021-09-16 10:57:16", 1.23
  
) %>%
  mutate(date = ymd_hms(date))

如何有效地做到这一点，有什么想法吗？

【问题讨论】：

标签： r lubridate dplyr

【解决方案1】：

如果我理解正确，这可能会解决您的问题

library(dplyr)

df1 %>% 
  filter(date <= first(date[val == max(val)]))

# A tibble: 10 x 2
   date                  val
   <dttm>              <dbl>
 1 2021-09-16 10:59:29  1.29
 2 2021-09-16 10:59:14  1.29
 3 2021-09-16 10:59:14  1.28
 4 2021-09-16 10:59:14  1.28
 5 2021-09-16 10:58:17  1.28
 6 2021-09-16 10:58:17  1.28
 7 2021-09-16 10:58:05  1.26
 8 2021-09-16 10:58:05  1.26
 9 2021-09-16 10:58:05  1.23
10 2021-09-16 10:57:16  1.23

【讨论】：

做到了！ first 来自哪个库？
@cephalopod dplyr

【解决方案2】：

这里有几个使用 match 的其他 dplyr 选项。

使用slice -

library(dplyr)
df1 %>% slice(match(max(val), val):n())

#   date                  val
#   <dttm>              <dbl>
# 1 2021-09-16 10:59:29  1.29
# 2 2021-09-16 10:59:14  1.29
# 3 2021-09-16 10:59:14  1.28
# 4 2021-09-16 10:59:14  1.28
# 5 2021-09-16 10:58:17  1.28
# 6 2021-09-16 10:58:17  1.28
# 7 2021-09-16 10:58:05  1.26
# 8 2021-09-16 10:58:05  1.26
# 9 2021-09-16 10:58:05  1.23
#10 2021-09-16 10:57:16  1.23

使用filter

df1 %>% filter(row_number() >= match(max(val), val))

您也可以使用 base R 来做同样的事情。

df1[match(max(df1$val), df1$val):nrow(df1), ]

【讨论】：

【解决方案3】：

我们可以使用

library(dplyr)
df1 %>% 
    filter(row_number() >=which.max(val))

-输出

# A tibble: 10 x 2
   date                  val
   <dttm>              <dbl>
 1 2021-09-16 10:59:29  1.29
 2 2021-09-16 10:59:14  1.29
 3 2021-09-16 10:59:14  1.28
 4 2021-09-16 10:59:14  1.28
 5 2021-09-16 10:58:17  1.28
 6 2021-09-16 10:58:17  1.28
 7 2021-09-16 10:58:05  1.26
 8 2021-09-16 10:58:05  1.26
 9 2021-09-16 10:58:05  1.23
10 2021-09-16 10:57:16  1.23

【讨论】：

【解决方案4】：

df1 %>%
  filter(cumsum(val == max(val)) >= 1)

在这里，我们保留达到最大值的累积时间至少为 1 的行。

我在这里假设数据已经按日期排序。

【讨论】：