【问题标题】:R: subset data.frame based on column value using dplyrR:使用 dplyr 基于列值的子集 data.frame
【发布时间】:2017-12-20 01:25:22
【问题描述】:
library(dplyr)
mydat1 <- data.frame(ID = c(1, 1, 2, 2),
                    Gender = c("Male", "Female", "Male", "Male"),
                    Score = c(30, 40, 20, 60))
mydat1 %>%
  group_by(ID, Gender) %>%
  slice(which.min(Score))

# A tibble: 3 x 3
# Groups:   ID, Gender [3]
     ID Gender Score
  <dbl> <fctr> <dbl>
1     1 Female    40
2     1   Male    30
3     2   Male    20

我正在尝试按IDGender 对行进行分组。然后我只想保留Score 最低的行。上面的代码完美运行,因为当ID == 2 时,我只保留了分数较低的条目。

mydat2 <- data.frame(ID = c(1, 1, 2, 2),
                    Gender = c("Male", "Female", "Male", "Male"),
                    Score = c(NA, NA, 20, 60))

mydat2 %>%
  group_by(ID, Gender) %>%
  slice(which.min(Score))

# A tibble: 1 x 3
# Groups:   ID, Gender [1]
     ID Gender Score
  <dbl> <fctr> <dbl>
1     2   Male    20

但是,当我有 NA 时,which.min 不会像我想要的那样工作,因为它不会返回有效的索引。相反,我所有的ID == 1 条目都被删除了。在这种情况下,我想要的输出是:

# A tibble: 1 x 3
# Groups:   ID, Gender [1]
     ID Gender Score
  <dbl> <fctr> <dbl>
1     1 Female    NA
2     1   Male    NA
1     2   Male    20

如何修改我的代码来解决这个问题?

编辑:

df2 <- structure(list(pubmed_id = c(23091106L, 23091106L), Gender = structure(c(4L, 
                                                                                4L), .Label = c("", "Both", "female", "Female", "Male"), class = "factor"), 
                      Total_Carrier = c(NA, 1107)), class = c("grouped_df", "tbl_df", 
                                                              "tbl", "data.frame"), row.names = c(NA, -2L), vars = "pubmed_id", drop = TRUE, indices = list(
                                                                0:1), group_sizes = 2L, biggest_group_size = 2L, labels = structure(list(
                                                                  pubmed_id = 23091106L), class = "data.frame", row.names = c(NA, 
                                                                                                                              -1L), vars = "pubmed_id", drop = TRUE, .Names = "pubmed_id"), .Names = c("pubmed_id", 
                                                                                                                                                                                                       "Gender", "Total_Carrier"))

> df2
# A tibble: 2 x 3
# Groups:   pubmed_id [1]
  pubmed_id Gender Total_Carrier
      <int> <fctr>         <dbl>
1  23091106 Female            NA
2  23091106 Female          1107

在此示例中,我希望所需的输出仅包含第 2 行(即载体样本大小为 1107 的行)。但是,我得到以下结果:

> df2 %>%
   group_by(pubmed_id, Gender) %>%
   slice(which.min(Total_Carrier) || 1)

# A tibble: 1 x 3
# Groups:   pubmed_id, Gender [1]
  pubmed_id Gender Total_Carrier
      <int> <fctr>         <dbl>
1  23091106 Female            NA

【问题讨论】:

    标签: r dataframe dplyr


    【解决方案1】:

    which.min 忽略缺失值,并在输入向量仅包含 NAs 时返回 integer(0)。您可以在slice中添加条件检查,即当一个组中所有Scores都是NAs时,选择第一行:

    mydat2 %>%
         group_by(ID, Gender) %>%
         slice({idx <- which.min(Score); if(length(idx) > 0) idx else 1})
    
    # A tibble: 3 x 3
    # Groups:   ID, Gender [3]
    #     ID Gender Score
    #  <dbl> <fctr> <dbl>
    #1     1 Female    NA
    #2     1   Male    NA
    #3     2   Male    20
    

    【讨论】:

      【解决方案2】:

      您还可以使用arrange 对组内的分数进行排序,然后使用slice 选择每个组的第一行。这样,如果组中只有 NA,您仍然会选择第一行:

      mydat2 %>%
      group_by(ID, Gender) %>%
      arrange(ID,Gender,Score) %>%
      slice(1)
           ID Gender Score
        <dbl> <fctr> <dbl>
      1     1 Female    NA
      2     1   Male    NA
      3     2   Male    20
      

      【讨论】:

        【解决方案3】:

        这是whichpmin 的另一个选项

        mydat2 %>%
           group_by(ID, Gender) %>% 
           slice(pmin(1, which(Score == min(Score, na.rm = TRUE))[1], na.rm = TRUE))
        # A tibble: 3 x 3
        # Groups:   ID, Gender [3]
        #      ID Gender Score
        #   <dbl> <fctr> <dbl>
        #1     1 Female    NA
        #2     1   Male    NA
        #3     2   Male    20
        

        【讨论】:

          【解决方案4】:

          使用data.table的解决方案

          library(data.table)
          setDT(mydat2)
          mydat2[, .(Score = sort(Score)[1]), by = .(ID, Gender)]
          #    ID Gender Score
          # 1:  1   Male    NA
          # 2:  1 Female    NA
          # 3:  2   Male    20
          

          【讨论】:

            猜你喜欢
            • 2015-02-01
            • 2017-02-14
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2018-11-25
            相关资源
            最近更新 更多