根据一列中的最大值和唯一值过滤行答案

【问题标题】：Filter row based on max & unique value in one column根据一列中的最大值和唯一值过滤行
【发布时间】：2021-04-11 09:24:25
【问题描述】：

解释起来有点麻烦，我尽力了，在下方查询。我有一个 df 如下。我需要根据国家列中的最大流行按组过滤行，但在上述组中尚未发生。（根据输出（图片），A 之所以没有出现在 group2 中，是因为它已经出现在 Group 1 中）

简而言之，我需要在国家/地区列中获取唯一值，同时在 pop 中获取最大值（在组级别上）。我希望图片能传达我无法传达的信息。（首选 Tidyverse 解决方案）

[![预期输出][2]][2]

df<- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), pop = c(200L, 100L, 50L, 200L, 150L, 120L, 200L, 150L, 
100L)), class = "data.frame", row.names = c(NA, -9L))

【问题讨论】：

如果G也出现在A组，值为150，那最后的结果会不会一样呢？那么如果一个国家“输”在一个组中，它能否“赢得”另一个价值较低的组？
我认为 OP 只想从下一次迭代中消除国家。它的流行值无关紧要！让 Vaibhav 澄清

标签： r dplyr tidyverse

【解决方案1】：

我认为这样就可以了。语法解释

将数据拆分为每个组的列表
离开第一组（因为它将在下一步中用作.init，但在过滤pop 值的最大值之后。
在此处使用purrr::reduce，这会将小标题列表减少为单个小标题
reduce 中使用的迭代次数
- .init 用作过滤的第一组
- 此后通过anti_join 删除之前组中的国家/地区
- 此数据再次过滤为最大 pop
- 添加了bind_rows()之前过滤的国家/地区
因此，最终我们将获得所需的小标题。

df %>% group_split(Group) %>% .[-1] %>%
  reduce(.init =df %>% group_split(Group) %>% .[[1]] %>% 
               filter(pop == max(pop)), 
             ~ .y  %>%
               anti_join(.x, by = c("country" = "country")) %>% 
               filter(pop == max(pop)) %>%
               bind_rows(.x) %>% arrange(Group)) 

# A tibble: 3 x 3
  Group country   pop
  <int> <chr>   <int>
1     1 A         200
2     2 E         150
3     3 G         100

【讨论】：

已编辑（已更正）
这也适用于一个国家在一个不是最高的组中具有更高价值的情况。

【解决方案2】：

您可以创建一个辅助函数，将每个组的最大弹出次数写入向量中，并使用它来过滤数据帧。

library(tidyverse)
max_values <- c()

helper <- function(dat, ...){
  dat <- dat[!(dat %in% max_values)] # exclude maximum values from previous groups
  max_value <- max(dat) # get current max. value
  max_values <<- c(max_values, max_value) # append 
  return(max_value)
}

df %>% 
  group_by(Group) %>% 
  filter(pop == helper(pop))

给你：

# A tibble: 3 x 3
# Groups:   Group [3]
  Group country   pop
  <int> <chr>   <int>
1     1 A         200
2     2 E         150
3     3 H         120

使用的数据：

> df
   Group country pop
1      1       A 200
2      1       B 100
3      1       C  50
4      2       A 200
5      2       E 150
6      2       F 120
7      3       A 200
8      3       E 150
9      3       G 100
10     3       H 120

【讨论】：

即使一个国家在不是最高的组中具有更高的价值，这也有效。
这需要使用全局变量（

【解决方案3】：

这是另一种可能性，但是 过于简化，因为它没有考虑到一个群体在一个群体中拥有更高人口的可能性，其中它没有赢。

library(dplyr)
df<- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), pop = c(200L, 100L, 50L, 200L, 150L, 120L, 200L, 150L, 
100L)), class = "data.frame", row.names = c(NA, -9L))

df %>% 
  group_by(country) %>% 
  summarize(popmax = max(pop))  %>% 
  inner_join(df, by = c("popmax" = 'pop')) %>% 
  rename(country = country.y) %>% 
  select(-country.x) %>% 
  group_by(country) %>% 
  arrange(Group) %>% 
  slice(1) %>% 
  ungroup() %>% 
  group_by(Group) %>% 
  arrange(country) %>% 
  slice(1) %>%  
  select(Group, country, popmax) %>% 
  rename(pop = popmax)

我的答案失败了（而其他答案没有）这个数据集：

df <- tribble(
  ~Group, ~ country, ~pop,
  1     ,         'A',    200,
  1     ,         'B',    100,
  1     ,         'C',     50,
  1     ,         'G',    150,
  2     ,         'A',    200,
  2     ,         'E',    150,
  2     ,         'F',    120,
  3     ,         'A',    200,
  3     ,         'E',    150,
  3     ,         'G',    100
)

【讨论】：

【解决方案4】：

更新 @Crestor 声称我的答案不正确。

我的答案是正确的，因为我的代码提供了 OP 要求的所需输出。
您对我的代码在其他场景中不起作用的反对意见可能是正确的，但在这种情况下它是无关紧要的，因为我的回答只是为了解决手头的任务。
这是您使用此数据集提出的方案的答案：

df1 <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), 
    country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), 
    pop = c(200L, 100L, 250L, 220L, 150L, 120L, 200L, 150L, 100L
    )), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
))

Crestor 的预期输出：

# A tibble: 3 x 3
  Group country   pop
  <int> <chr>   <int>
1     1 C         250
2     2 A         220
3     3 E         150

我的场景代码@crestor

library(dplyr)

df1 %>% 
group_by(country) %>% 
arrange(Group) %>% 
filter(pop == max(pop)) %>% 
group_by(Group) %>% 
filter(pop == max(pop))

输出：

  Group country   pop
  <int> <chr>   <int>
1     1 C         250
2     2 A         220
3     3 E         150

OP 对问题的原始回答

为简单起见：首先arrange 将您的数据集放置到位。然后group_by 并用slice 保持每组的第一行。然后group_byGroup和filter最大pop

library(dplyr)
df %>% 
  arrange(country, pop) %>% 
  group_by(country) %>% 
  slice(1) %>% 
  group_by(Group) %>% 
  filter(pop==max(pop))

输出：

  Group country   pop
  <int> <chr>   <int>
1     1 A         200
2     2 E         150
3     3 G         100

【讨论】：

一个问题是：为什么在您的代码中，A 被放在第 1 组中？这似乎与它在第 2 组之前在第 1 组中的事实无关。但是如果打破平局的是组顺序而不是国家顺序怎么办？
感谢您的宝贵意见。 A 是 Group1 的一部分。无论如何，在这种情况下放置后续的As 并不重要。第一次出现最大值后的国家是不相关的，因为假设是：“国家列中的最大流行但尚未在上述组中出现”。逻辑需要跳出框框思考，除非提供其他证据，否则这种方法似乎可行。
这是不正确的。例如，在这种情况下： df
答案是具体的问题。对于您的数据集更改：group_by(country) 和 group_by(country, pop)。让我知道你的想法？干杯。
即使将 group_by(country) 更改为 group_by(country, pop)，也无法获得正确答案。