【问题标题】:Sample_n with if_else after group_by in dataframe在数据框中的 group_by 之后带有 if_else 的 Sample_n
【发布时间】:2020-11-30 12:23:24
【问题描述】:

这是一个测试 DF:

test_df <- structure(list(plant_sp = c("plant_1", "plant_1", "plant_2", "plant_2", "plant_3",
                                       "plant_3", "plant_3", "plant_3", "plant_3", "plant_4", 
                                       "plant_4", "plant_4", "plant_4", "plant_4", "plant_4",
                                       "plant_5", "plant_5", "plant_5", "plant_5", "plant_5"), 
                          site = c("a", "a", "a", "a", "a",  
                                   "b", "b", "b", "b", "b",  
                                   "a", "a", "a", "a", "a",
                                   "b", "b", "b", "b", "b"),
                          sp_rich = c(5, 3, 5, 3, 5, 
                                      7, 8, 8, 8, 10,
                                      1, 4, 5, 6, 3, 
                                      7, 3, 12, 12,11)), 
                     row.names = c(NA, -20L), class = "data.frame", 
                     .Names = c("plant_sp", "site", "sp_rich"))

如果组中的行数大于 3,我想 group_by plant_sp 并提取 3 个随机行。

换句话说:取每个组,如果组大小大于 3,则在该组中随机只保留 3 行。

我正在尝试使用 if_else 但我无法做到这一点:

test_df <- test_df %>% group_by(plant_sp) %>%
if_else(length(plant_sp) > 3, sample_n(size =3))

我猜我没有正确使用 length() 函数。

你能帮帮我吗?

谢谢,伊多

【问题讨论】:

    标签: r dplyr tidyr


    【解决方案1】:

    如果您使用的是dplyr 1.0.0 或更高版本,则可以使用slice_sample。它将在每组中保留 3 行。如果每组中的行数少于 3,它将保留所有行。

    library(dplyr)
    test_df %>% group_by(plant_sp) %>% slice_sample(n = 3)
    
    #  plant_sp site  sp_rich
    #   <chr>    <chr>   <dbl>
    # 1 plant_1  a           3
    # 2 plant_1  a           5
    # 3 plant_2  a           5
    # 4 plant_2  a           3
    # 5 plant_3  b           8
    # 6 plant_3  b           8
    # 7 plant_3  b           7
    # 8 plant_4  b          10
    # 9 plant_4  a           5
    #10 plant_4  a           4
    #11 plant_5  b           7
    #12 plant_5  b          12
    #13 plant_5  b           3
    

    【讨论】:

      【解决方案2】:

      这有帮助吗?也许不是最优雅的版本,但应该可以解决问题。

      这里是针对评论的编辑答案:

      test_df <- structure(list(plant_sp = c("plant_1", "plant_1", "plant_2", "plant_2", "plant_3",
                                             "plant_3", "plant_3", "plant_3", "plant_3", "plant_4", 
                                             "plant_4", "plant_4", "plant_4", "plant_4", "plant_4",
                                             "plant_5", "plant_5", "plant_5", "plant_5", "plant_5"), 
                                site = c("a", "a", "a", "a", "a",  
                                         "b", "b", "b", "b", "b",  
                                         "a", "a", "a", "a", "a",
                                         "b", "b", "b", "b", "b"),
                                sp_rich = c(5, 3, 5, 3, 5, 
                                            7, 8, 8, 8, 10,
                                            1, 4, 5, 6, 3, 
                                            7, 3, 12, 12,11)), 
                           row.names = c(NA, -20L), class = "data.frame", 
                           .Names = c("plant_sp", "site", "sp_rich"))
      
      library(tidyverse)
      df_group <- test_df %>% 
        group_by(plant_sp) %>% 
        mutate(row_number=row_number()) %>% 
        mutate(row_max=max(row_number)) %>% 
        ungroup()
      
      df_3 <- df_group %>% 
        group_by(plant_sp) %>% 
        filter(row_max>3) %>% 
        slice_sample(n = 3)
      
      df_small <- df_group %>% 
        filter(row_max<4)
      
      df_test <- bind_rows(df_3, df_small) %>% 
        arrange(plant_sp)
      df_test
      #> # A tibble: 13 x 5
      #> # Groups:   plant_sp [5]
      #>    plant_sp site  sp_rich row_number row_max
      #>    <chr>    <chr>   <dbl>      <int>   <int>
      #>  1 plant_1  a           5          1       2
      #>  2 plant_1  a           3          2       2
      #>  3 plant_2  a           5          1       2
      #>  4 plant_2  a           3          2       2
      #>  5 plant_3  b           8          4       5
      #>  6 plant_3  a           5          1       5
      #>  7 plant_3  b           7          2       5
      #>  8 plant_4  a           3          6       6
      #>  9 plant_4  b          10          1       6
      #> 10 plant_4  a           5          4       6
      #> 11 plant_5  b           7          1       5
      #> 12 plant_5  b          12          4       5
      #> 13 plant_5  b          12          3       5
      

      reprex package (v0.3.0) 于 2020 年 11 月 30 日创建

      【讨论】:

      • 非常感谢,但有什么办法可以将植物_1 和植物_2 保留在 DF 中?我需要它们。
      • @Ido 见上文。我编辑了答案。我只是将数据框分成两组。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-09-23
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-06-09
      相关资源
      最近更新 更多