过滤掉R中只有一种类型值的组[重复]答案

【问题标题】：Filtering out groups that only have one type of value in R [duplicate]过滤掉R中只有一种类型值的组[重复]
【发布时间】：2021-11-14 06:34:20
【问题描述】：

我正在尝试过滤数据框中只有一种类型的值与之关联的组。我想这很简单。这是我的数据框：-

example<-structure(list(UserID = c("AAA", "AAA", "AAA", "AAA", "AAA", 
                                   "AAA", "AAA", "AAA", "AAA", "BBB", "BBB", "BBB", "BBB", "BBB", 
                                   "BBB", "BBB", "BBB", "CCC", "CCC", "CCC", "CCC", "CCC", "CCC", 
                                   "CCC", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", 
                                   "DDD", "DDD", "DDD"), Status = c("Cluster 1", "Cluster 1", "Cluster 1", 
                                                                    "NotActive", "NotActive", "Cluster 1", "Cluster 2", "Cluster 2", 
                                                                    "Cluster 2", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "Cluster 1", "Cluster 1", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive")), row.names = c(NA, -35L), class = c("tbl_df", 
                                                                                                                                   "tbl", "data.frame"))

基本上，我想过滤掉只有一种状态称为“NotActive”的组。一些用户混合了“Cluster _”和“NotActive”，我想保留这些。我有一个包含数千个组的大型数据集，需要过滤掉，所以它不像示例中那样过滤掉UserID BBB 和 DDD 那么简单，所以可以在更大范围内应用一些东西。像 AAA 和 CCC 这样的用户 ID 应保留包含“NotActive”在内的混合值，只有具有“NotActive”作为状态的用户才能保留

任何指针都会很棒:)

【问题讨论】：

这能回答你的问题吗？ group by and filter data management using dplyr

标签： r filter dplyr

【解决方案1】：

这是一种方法。按UserID 分组并测试行数以查看它是否大于 1 或是否有任何状态与非活动不同。

library(dplyr)

example %>% 
  group_by(UserID) %>%
  filter(n() > 1 | any(Status != "NotActive"))

【讨论】：

【解决方案2】：

您可以使用any 仅保留至少有一个值不是'NotActive' 的组。

在dplyr，你可以使用-

library(dplyr)
example %>%  group_by(UserID) %>% filter(any(Status != 'NotActive'))

#   UserID Status   
#   <chr>  <chr>    
# 1 AAA    Cluster 1
# 2 AAA    Cluster 1
# 3 AAA    Cluster 1
# 4 AAA    NotActive
# 5 AAA    NotActive
# 6 AAA    Cluster 1
# 7 AAA    Cluster 2
# 8 AAA    Cluster 2
# 9 AAA    Cluster 2
#10 CCC    NotActive
#11 CCC    NotActive
#12 CCC    NotActive
#13 CCC    NotActive
#14 CCC    Cluster 1
#15 CCC    Cluster 1
#16 CCC    NotActive

在基础 R 和 data.table 中相同。

#Base R
subset(example, ave(Status != 'NotActive', UserID, FUN = any))


#data.table
library(data.table)
setDT(example)[, .SD[any(Status != 'NotActive')], UserID]

【讨论】：

【解决方案3】：

我不知道纯dplyr 解决方案，但如果tibble 不是太大而无法应用table，这应该可以解决问题。此解决方案不会保留具有多个特定状态（例如“NotActive”）的 UserID，而是具有多个状态，无论它们是：

获取数据

example<-structure(list(UserID = c("AAA", "AAA", "AAA", "AAA", "AAA", 
                                   "AAA", "AAA", "AAA", "AAA", "BBB", "BBB", "BBB", "BBB", "BBB", 
                                   "BBB", "BBB", "BBB", "CCC", "CCC", "CCC", "CCC", "CCC", "CCC", 
                                   "CCC", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", 
                                   "DDD", "DDD", "DDD"), Status = c("Cluster 1", "Cluster 1", "Cluster 1", 
                                                                    "NotActive", "NotActive", "Cluster 1", "Cluster 2", "Cluster 2", 
                                                                    "Cluster 2", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "Cluster 1", "Cluster 1", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive")), row.names = c(NA, -35L), class = c("tbl_df", 
                                                                                                                                   "tbl", "data.frame"))


example
#> # A tibble: 35 × 2
#>    UserID Status   
#>    <chr>  <chr>    
#>  1 AAA    Cluster 1
#>  2 AAA    Cluster 1
#>  3 AAA    Cluster 1
#>  4 AAA    NotActive
#>  5 AAA    NotActive
#>  6 AAA    Cluster 1
#>  7 AAA    Cluster 2
#>  8 AAA    Cluster 2
#>  9 AAA    Cluster 2
#> 10 BBB    NotActive
#> # … with 25 more rows

过滤行

library(dplyr)

num_status <- apply(table(example$UserID, example$Status), 1, function(x) length(x[x>0]))

example %>% filter(UserID %in% names(which(num_status > 1)))

#> # A tibble: 16 × 2
#>    UserID Status   
#>    <chr>  <chr>    
#>  1 AAA    Cluster 1
#>  2 AAA    Cluster 1
#>  3 AAA    Cluster 1
#>  4 AAA    NotActive
#>  5 AAA    NotActive
#>  6 AAA    Cluster 1
#>  7 AAA    Cluster 2
#>  8 AAA    Cluster 2
#>  9 AAA    Cluster 2
#> 10 CCC    NotActive
#> 11 CCC    NotActive
#> 12 CCC    NotActive
#> 13 CCC    NotActive
#> 14 CCC    Cluster 1
#> 15 CCC    Cluster 1
#> 16 CCC    NotActive

^{由reprex package (v2.0.1) 于 2021-09-20 创建}

【讨论】：