如何在不丢失 R 中数据的行的情况下进行分组过滤？答案

【问题标题】：How do I group by then filter without losing rows from data in R?如何在不丢失 R 中数据的行的情况下进行分组过滤？
【发布时间】：2021-04-08 18:54:01
【问题描述】：

我有一个示例tbl_df，我正在尝试寻找解决方案。我正在尝试在高水平上执行以下操作。将学生在 2021 年的最高分数（基于他们拥有最多的类型计数）与他们在 2021 年之前最近一年的type 的最新结果进行比较。我想使用dplyr::filter，但可以'不知道如何正确地 filter 保留 tbl_df 以获取我的输出。

简而言之：

按full_name 分组，然后在count 列中选择type 具有max 值的行，用于2021 年
为同一type选择下一个最近的年份

如你所见，由于埃里克·柯林斯在 2020 年没有排行，他最近的一年是 2019 年，而其他的有 2020 年的值。

示例：

sample_df <- tibble::tribble(
                  ~year,           ~full_name,        ~type, ~count, ~avg_score, ~max,
                  2021L,       "Jason Valdez",   "Sciences",   "33",         98,   99,
                  2021L,       "Jason Valdez", "Humanities",   "59",         97,   99,
                  2020L,       "Jason Valdez",   "Sciences",  "164",         97,   99,
                  2020L,       "Jason Valdez", "Humanities",  "231",         96,   98,
                  2019L,       "Jason Valdez",   "Sciences",  "933",         96,   99,
                  2019L,       "Jason Valdez", "Humanities",  "853",         95,   99,
                  2021L,       "Eric Collins",   "Sciences",   "21",         92,   93,
                  2019L,       "Eric Collins",   "Sciences",  "831",         94,   97,
                  2019L,       "Eric Collins", "Humanities",   "10",         94,   97,
                  2021L, "Sebastian Goldberg",   "Sciences",   "41",         93,   96,
                  2020L, "Sebastian Goldberg",   "Sciences",  "476",         94,   98,
                  2020L, "Sebastian Goldberg", "Humanities",   "81",         93,   96,
                  2019L, "Sebastian Goldberg",   "Sciences", "1418",         95,   98
                  )

output_df <- tibble::tribble(
  ~year,           ~full_name,        ~type, ~count, ~avg_score, ~max,
  2021L,       "Jason Valdez", "Humanities",    59L,        95L,  96L,
  2020L,       "Jason Valdez", "Humanities",   231L,        96L,  98L,
  2021L,       "Eric Collins",   "Sciences",    21L,        92L,  93L,
  2019L,       "Eric Collins",   "Sciences",   831L,        94L,  97L,
  2021L, "Sebastian Goldberg",   "Sciences",    41L,        93L,  96L,
  2020L, "Sebastian Goldberg",   "Sciences",   476L,        94L,  98L
  )

【问题讨论】：

标签： r dplyr

【解决方案1】：

按“全名”分组，filter“类型”基于对应于 maxcount 值的“类型”，其中“年份”为 2021，然后slice 最大 2 行由'年份'

library(dplyr)
sample_df %>% 
   group_by(full_name) %>% 
   filter(type %in% type[which.max(count[year == 2021])])%>% 
   slice_max(order_by= year, n = 2) %>%
   ungroup %>%
   arrange(factor(full_name, levels = unique(sample_df$full_name)))

-输出

# A tibble: 6 x 6
#   year full_name          type       count avg_score   max
#  <int> <chr>              <chr>      <chr>     <dbl> <dbl>
#1  2021 Jason Valdez       Humanities 59           97    99
#2  2020 Jason Valdez       Humanities 231          96    98
#3  2021 Eric Collins       Sciences   21           92    93
#4  2019 Eric Collins       Sciences   831          94    97
#5  2021 Sebastian Goldberg Sciences   41           93    96
#6  2020 Sebastian Goldberg Sciences   476          94    98

【讨论】：

完美。那么过滤器中的 [ ] 是否允许您创建过滤器子集，而无需像传统过滤器那样删除行？
完美。谢谢阿克伦！
@Jazzmatazz 这只是为了确保我们不会丢失整行。即假设我们做filter(year == 2021) 然后我们可能需要merge 稍后的子集数据
是的。这就是我要问的。谢谢！我不知道你可以在过滤器中做到这一点。底部的排列/因素是什么？
@Jazzmatazz 只是按照原始顺序排列输出。它可以在group_by(full_name = factor(full_name, levels = unique(full_name))) 中删除和使用，否则，它将按字母顺序而不是唯一名称的出现顺序对行进行排序