根据数据框中行中的条件删除重复项答案

【问题标题】：Remove duplicates based on conditions in rows in a dataframe根据数据框中行中的条件删除重复项
【发布时间】：2021-08-09 14:58:28
【问题描述】：

我有一个包含许多重复名称的数据框，下面是一个可重现的示例。
我正在尝试通过删除具有重复名称和最低信息的行来清理数据集。
我添加了一列，用于计算每行中 % of NA 的单元格，在我的示例中，我将其称为 %_Scoring .

在重复名称行中，我想保留最低 %_Scoring (% of NA)
N:B 如果 %_Scoring 相等，没关系，仍应删除两行之一。

data_people <- "https://raw.githubusercontent.com/max9nc9/Temp/main/data_people.csv"
data_people <- read.csv(data_people, sep = ",")

在上面的数据示例中，我只保留 2 行：

第一排是玛格丽塔潘
第二行是 John Doe，其中 %_Scoring = 0.56

【问题讨论】：

好的，我编辑了我的帖子，谢谢！

标签： r dataframe duplicates data-wrangling

【解决方案1】：

按“名称”分组后使用slice_max

library(dplyr)
data_people %>% 
    group_by(Name) %>%
    slice_max(n = 1, order_by = X._Scoring) %>%
    ungroup

-输出

# A tibble: 2 x 4
  Name          Information                    Height X._Scoring
  <chr>         <chr>                           <dbl>      <dbl>
1 John Doe      This is an information           1.88       0.89
2 Margarita Pan This is an information as well   1.47       0.78

或者如果我们想保持最小值，那么使用slice_min

data_people %>% 
    group_by(Name) %>%
    slice_min(n = 1, order_by = X._Scoring) %>%
    ungroup
# A tibble: 2 x 4
  Name          Information                    Height X._Scoring
  <chr>         <chr>                           <dbl>      <dbl>
1 John Doe      This is an information          NA          0.56
2 Margarita Pan This is an information as well   1.47       0.78

【讨论】：

【解决方案2】：

library(dplyr)
data_people %>% 
    group_by(Name) %>% 
    arrange(X._Scoring) %>% 
    filter(!duplicated(Name) & min(X._Scoring))

输出

  Name          Information                    Height X._Scoring
  <chr>         <chr>                           <dbl>      <dbl>
1 John Doe      This is an information          NA          0.56
2 Margarita Pan This is an information as well   1.47       0.78

【讨论】：

【解决方案3】：

带有duplicated + ave 的基本 R 选项

subset(
  data_people,
  !duplicated(Name) & ave(rowSums(!is.na(data_people)), Name, FUN = function(x) x == max(x))
)

给予

           Name                    Information Height X._Scoring
1      John Doe         This is an information   1.88       0.89
2 Margarita Pan This is an information as well   1.47       0.78

【讨论】：