【问题标题】:How to select top N values and group the rest of the remaining ones如何选择前 N 个值并将其余的值分组
【发布时间】:2022-01-15 04:09:00
【问题描述】:

如何对与计数列中的最高值关联的前 4 组数据帧进行排名,并创建第 5 组来总结剩余组及其相关值?

到目前为止我做了什么:

dummy_dataframe <- data.frame(group = c("A", "B", "A", "A", "C", "C", "D", "E", "F", "D","G")) 

df_aggregate <- aggregate(cbind(count = group) ~ group, 
                         data = dumy_dataframe, 
                         FUN = function(x){NROW(x)})

df_sliced <- df_aggregate %>%
       arrange(desc(count)) %>% 
      slice(1:4) 

通过上面的代码,我得到了一个数据框,其中 4 个组与最高值相关联,但是我如何才能有一个第五组来总结缺失组(E、F 和 G)的值?比如这样的:

   group     count
1     A        3
2     B        1
3     C        2
4     D        2
5   others     3

【问题讨论】:

    标签: r dataframe


    【解决方案1】:

    又短又甜:

    result<-rbind(df_aggregate[order(df_aggregate$count,decreasing = T),][c(1:4),],c("rest",sum(df_aggregate[order(df_aggregate$count,decreasing = T),][c(5:nrow(df_aggregate)),2])))
    

    【讨论】:

    • 很好,谢谢!
    【解决方案2】:

    您可以直接在原始数据帧上运行一些 tidyverse 操作:

    library(tidyverse)
    dummy_dataframe %>%
      count(group) %>%
      mutate(id = if_else(row_number() < 5, 1L, 2L)) %>%
      group_by(id) %>%
      arrange(id, -n) %>%
      mutate(group = if_else(id == 2L, "others", group),
             n = if_else(group == "others", sum(n), n)) %>%
      ungroup() %>%
      distinct() %>%
      select(-id)
    

    给出:

    # A tibble: 5 x 2
      group      n
      <chr>  <int>
    1 A          3
    2 C          2
    3 D          2
    4 B          1
    5 others     3
    

    【讨论】:

      【解决方案3】:

      我会完全使用 dplyr 包及其可能性:

      library(dplyr)
      
      dummy_dataframe <- data.frame(group = c("A", "B", "A", "A", "C", "C", "D", "E", "F", "D","G")) 
      
      df_aggregate <- dummy_dataframe %>%
        group_by(group
                 ) %>%
        summarise(count = n()
                 ) %>%
        arrange(desc(count)
                 ) 
      
      df_top_4_groups <- df_aggregate %>%
        slice(1:4)
      
      df_others <- df_aggregate %>%
        anti_join(df_top_4_groups, by = "group"
               ) %>%
        mutate(group = "others"
               ) %>%
        group_by(group
               ) %>%
        summarise(count = n()
               )
      
      df_finale <- df_top_4_groups %>%
        bind_rows(df_others)
      
      df_finale
      A tibble: 5 x 2
        group  count
        <chr>  <int>
      1 A          3
      2 C          2
      3 D          2
      4 B          1
      5 others     3
      

      您对聚合的使用没有错——非常酷;)——但我认为从上到下使用管道使其更具可读性。

      【讨论】:

        猜你喜欢
        • 2013-01-25
        • 1970-01-01
        • 2021-07-11
        • 2019-01-23
        • 2022-01-23
        • 2016-03-28
        • 1970-01-01
        • 2023-01-24
        • 2011-08-28
        相关资源
        最近更新 更多