【问题标题】:Filtering out groups that only have one type of value in R [duplicate]过滤掉R中只有一种类型值的组[重复]
【发布时间】:2021-11-14 06:34:20
【问题描述】:

我正在尝试过滤数据框中只有一种类型的值与之关联的组。我想这很简单。这是我的数据框:-

example<-structure(list(UserID = c("AAA", "AAA", "AAA", "AAA", "AAA", 
                                   "AAA", "AAA", "AAA", "AAA", "BBB", "BBB", "BBB", "BBB", "BBB", 
                                   "BBB", "BBB", "BBB", "CCC", "CCC", "CCC", "CCC", "CCC", "CCC", 
                                   "CCC", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", 
                                   "DDD", "DDD", "DDD"), Status = c("Cluster 1", "Cluster 1", "Cluster 1", 
                                                                    "NotActive", "NotActive", "Cluster 1", "Cluster 2", "Cluster 2", 
                                                                    "Cluster 2", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "Cluster 1", "Cluster 1", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive")), row.names = c(NA, -35L), class = c("tbl_df", 
                                                                                                                                   "tbl", "data.frame"))


基本上,我想过滤掉只有一种状态称为“NotActive”的组。一些用户混合了“Cluster _”和“NotActive”,我想保留这些。 我有一个包含数千个组的大型数据集,需要过滤掉,所以它不像示例中那样过滤掉UserID BBB 和 DDD 那么简单,所以可以在更大范围内应用一些东西。像 AAA 和 CCC 这样的用户 ID 应保留包含“NotActive”在内的混合值,只有具有“NotActive”作为状态的用户才能保留

任何指针都会很棒:)

【问题讨论】:

标签: r filter dplyr


【解决方案1】:

这是一种方法。按UserID 分组并测试行数以查看它是否大于 1 或是否有任何状态与非活动不同。

library(dplyr)

example %>% 
  group_by(UserID) %>%
  filter(n() > 1 | any(Status != "NotActive"))

【讨论】:

    【解决方案2】:

    您可以使用any 仅保留至少有一个值不是'NotActive' 的组。

    dplyr,你可以使用-

    library(dplyr)
    example %>%  group_by(UserID) %>% filter(any(Status != 'NotActive'))
    
    #   UserID Status   
    #   <chr>  <chr>    
    # 1 AAA    Cluster 1
    # 2 AAA    Cluster 1
    # 3 AAA    Cluster 1
    # 4 AAA    NotActive
    # 5 AAA    NotActive
    # 6 AAA    Cluster 1
    # 7 AAA    Cluster 2
    # 8 AAA    Cluster 2
    # 9 AAA    Cluster 2
    #10 CCC    NotActive
    #11 CCC    NotActive
    #12 CCC    NotActive
    #13 CCC    NotActive
    #14 CCC    Cluster 1
    #15 CCC    Cluster 1
    #16 CCC    NotActive
    

    在基础 R 和 data.table 中相同。

    #Base R
    subset(example, ave(Status != 'NotActive', UserID, FUN = any))
    
    
    #data.table
    library(data.table)
    setDT(example)[, .SD[any(Status != 'NotActive')], UserID]
    

    【讨论】:

      【解决方案3】:

      我不知道纯dplyr 解决方案,但如果tibble 不是太大而无法应用table,这应该可以解决问题。此解决方案不会保留具有多个特定状态(例如“NotActive”)的 UserID,而是具有多个状态,无论它们是:

      获取数据

      example<-structure(list(UserID = c("AAA", "AAA", "AAA", "AAA", "AAA", 
                                         "AAA", "AAA", "AAA", "AAA", "BBB", "BBB", "BBB", "BBB", "BBB", 
                                         "BBB", "BBB", "BBB", "CCC", "CCC", "CCC", "CCC", "CCC", "CCC", 
                                         "CCC", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", 
                                         "DDD", "DDD", "DDD"), Status = c("Cluster 1", "Cluster 1", "Cluster 1", 
                                                                          "NotActive", "NotActive", "Cluster 1", "Cluster 2", "Cluster 2", 
                                                                          "Cluster 2", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                          "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                          "NotActive", "NotActive", "NotActive", "Cluster 1", "Cluster 1", 
                                                                          "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                          "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                          "NotActive", "NotActive")), row.names = c(NA, -35L), class = c("tbl_df", 
                                                                                                                                         "tbl", "data.frame"))
      
      
      example
      #> # A tibble: 35 × 2
      #>    UserID Status   
      #>    <chr>  <chr>    
      #>  1 AAA    Cluster 1
      #>  2 AAA    Cluster 1
      #>  3 AAA    Cluster 1
      #>  4 AAA    NotActive
      #>  5 AAA    NotActive
      #>  6 AAA    Cluster 1
      #>  7 AAA    Cluster 2
      #>  8 AAA    Cluster 2
      #>  9 AAA    Cluster 2
      #> 10 BBB    NotActive
      #> # … with 25 more rows
      
      

      过滤行

      library(dplyr)
      
      num_status <- apply(table(example$UserID, example$Status), 1, function(x) length(x[x>0]))
      
      example %>% filter(UserID %in% names(which(num_status > 1)))
      
      #> # A tibble: 16 × 2
      #>    UserID Status   
      #>    <chr>  <chr>    
      #>  1 AAA    Cluster 1
      #>  2 AAA    Cluster 1
      #>  3 AAA    Cluster 1
      #>  4 AAA    NotActive
      #>  5 AAA    NotActive
      #>  6 AAA    Cluster 1
      #>  7 AAA    Cluster 2
      #>  8 AAA    Cluster 2
      #>  9 AAA    Cluster 2
      #> 10 CCC    NotActive
      #> 11 CCC    NotActive
      #> 12 CCC    NotActive
      #> 13 CCC    NotActive
      #> 14 CCC    Cluster 1
      #> 15 CCC    Cluster 1
      #> 16 CCC    NotActive
      

      reprex package (v2.0.1) 于 2021-09-20 创建

      【讨论】:

        猜你喜欢
        • 2018-04-08
        • 1970-01-01
        • 2022-11-02
        • 1970-01-01
        • 2018-05-08
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多