【问题标题】:Remove duplicates based on conditions in rows in a dataframe根据数据框中行中的条件删除重复项
【发布时间】:2021-08-09 14:58:28
【问题描述】:

我有一个包含许多重复名称的数据框,下面是一个可重现的示例。
我正在尝试通过删除具有重复名称和最低信息的行来清理数据集。
我添加了一列,用于计算每行中 % of NA 的单元格,在我的示例中,我将其称为 %_Scoring .

在重复名称行中,我想保留 最低 %_Scoring (% of NA)
N:B 如果 %_Scoring 相等,没关系,仍应删除两行之一。

data_people <- "https://raw.githubusercontent.com/max9nc9/Temp/main/data_people.csv"
data_people <- read.csv(data_people, sep = ",")

在上面的数据示例中,我只保留 2 行:

  • 第一排是玛格丽塔潘
  • 第二行是 John Doe,其中 %_Scoring = 0.56

【问题讨论】:

  • 好的,我编辑了我的帖子,谢谢!

标签: r dataframe duplicates data-wrangling


【解决方案1】:

按“名称”分组后使用slice_max

library(dplyr)
data_people %>% 
    group_by(Name) %>%
    slice_max(n = 1, order_by = X._Scoring) %>%
    ungroup

-输出

# A tibble: 2 x 4
  Name          Information                    Height X._Scoring
  <chr>         <chr>                           <dbl>      <dbl>
1 John Doe      This is an information           1.88       0.89
2 Margarita Pan This is an information as well   1.47       0.78

或者如果我们想保持最小值,那么使用slice_min

data_people %>% 
    group_by(Name) %>%
    slice_min(n = 1, order_by = X._Scoring) %>%
    ungroup
# A tibble: 2 x 4
  Name          Information                    Height X._Scoring
  <chr>         <chr>                           <dbl>      <dbl>
1 John Doe      This is an information          NA          0.56
2 Margarita Pan This is an information as well   1.47       0.78

【讨论】:

    【解决方案2】:
    library(dplyr)
    data_people %>% 
        group_by(Name) %>% 
        arrange(X._Scoring) %>% 
        filter(!duplicated(Name) & min(X._Scoring))
    

    输出

      Name          Information                    Height X._Scoring
      <chr>         <chr>                           <dbl>      <dbl>
    1 John Doe      This is an information          NA          0.56
    2 Margarita Pan This is an information as well   1.47       0.78
    

    【讨论】:

      【解决方案3】:

      带有duplicated + ave 的基本 R 选项

      subset(
        data_people,
        !duplicated(Name) & ave(rowSums(!is.na(data_people)), Name, FUN = function(x) x == max(x))
      )
      

      给予

                 Name                    Information Height X._Scoring
      1      John Doe         This is an information   1.88       0.89
      2 Margarita Pan This is an information as well   1.47       0.78
      

      【讨论】:

        猜你喜欢
        • 2017-10-04
        • 2015-03-13
        • 1970-01-01
        • 2020-05-16
        • 2018-06-04
        • 2017-01-16
        • 2022-10-14
        • 2020-02-17
        • 1970-01-01
        相关资源
        最近更新 更多