【问题标题】:remove duplicate row based on conditional matching in another column根据另一列中的条件匹配删除重复行
【发布时间】:2021-05-20 20:56:20
【问题描述】:

在下面的数据中,我试图删除 mid 列中的重复行。我想保留 mid 重复但 kpi 匹配 B 的行。这应该跨组 county

我只是在这里显示重复项,但 dput 数据不仅仅是重复项

# A tibble: 34 x 3
   county mid kpi  
   <chr>  <chr>      <chr>
 1 Athens 1          A    
 2 Athens 1          B    
 3 Athens 2.13       A    
 4 Athens 2.13       B    
 5 Athens 2.3        A    
 6 Athens 2.3        B    
 7 Athens 2.4        A    
 8 Athens 2.4        B    
 9 Athens 3.3        A    
10 Athens 3.3        B    

从上表中,我想保留重复项中的所有 B 值。我不能简单地使用filter(kpi %in% B),因为下面的数据有 A 和 B 值,它们不重复,我想保留它们。

structure(list(county = c("Athens", "Athens", "Athens", "Athens", 
"Athens", "Athens", "Athens", "Athens", "Athens", "Athens", "Athens", 
"Athens", "Athens", "Athens", "Athens", "Athens", "Athens", "Athens", 
"Athens", "Athens", "Athens", "Athens", "Athens", "Athens", "Athens", 
"Athens", "Athens", "Athens", "Athens", "Athens", "Athens", "Athens", 
"Athens", "Athens"), measure_id = c("1", "1", "2.13", "2.13", 
"2.3", "2.3", "2.4", "2.4", "3.3", "3.3", "2.12.1", "2.12.1", 
"2.14.3", "2.14.3", "2.3.1", "2.3.1", "2.3.2", "2.3.2", "2.5.1", 
"2.5.1", "2.5.4", "2.5.4", "2.5.5", "2.5.5", "2.6.4", "2.6.4", 
"2.7.4", "2.7.4", "2.8.1", "2.8.1", "2.8.2", "2.8.2", "2.9.1", 
"2.9.1"), kpi = c("A", "B", "A", "B", "A", "B", "A", "B", "A", 
"B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", 
"A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B")), spec = structure(list(
    cols = list(county = structure(list(), class = c("collector_character", 
    "collector")), mid = structure(list(), class = c("collector_character", 
    "collector")), kpi = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), delim = "\t"), class = "col_spec"), problems = <pointer: 0x0000015517989d70>, row.names = c(NA, 
-34L), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
))  

【问题讨论】:

  • 更新了问题,有帮助吗?
  • 请在下方查看我的解决方案
  • 我认为在新的数据集中,中间是measure_id
  • 在您展示的可重现示例中,没有重复项

标签: r dplyr


【解决方案1】:

我们可以在识别出重复项后使用 anti_join!

df1 <- df %>% 
  filter(duplicated(mid)) %>% 
  mutate(kpi= replace(kpi, kpi=="B", "A")) 

anti_join(df, df1, by=c("county", "mid", "kpi"))

输出:

  county mid     kpi  
   <chr>  <chr>   <chr>
 1 Athens 1.1     A    
 2 Athens 1.2     A    
 3 Athens 1.3     A    
 4 Athens 1.4     A    
 5 Athens 1.5     A    
 6 Athens 1.6     A    
 7 Athens 2.1.1   A    
 8 Athens 2.1.2   A    
 9 Athens 2.1.3   A    
10 Athens 2.1.4   A    
11 Athens 2.2.1   A    
12 Athens 2.2.2   A    
13 Athens 2.2.3   A    
14 Athens 2.2.4   A    
15 Athens 2.3.1   B    
16 Athens 2.3.2   B    
17 Athens 2.3.3   A    
18 Athens 2.3.4   A    
19 Athens 2.3.5   A    
20 Athens 2.3.6   A    
21 Athens 2.11    A    
22 Athens 2.16    A    
23 Athens 2.3     B    
24 Athens 2.4     B    
25 Athens 2.5.2   A    
26 Athens 2.5.3   A    
27 Athens 2.5.3.A A    
28 Athens 2.5.3.B A    
29 Athens 2.5.5   B    
30 Athens 2.6.1   A    
31 Athens 2.6.2   A    
32 Athens 2.6.3   A    
33 Athens 2.6.4   B    
34 Athens 2.6.5   A    
35 Athens 2.6.6   A    
36 Athens 2.6.7   B    
37 Athens 2.7.2   A    
38 Athens 2.7.3   A    
39 Athens 2.7.3.A A    
40 Athens 2.7.3.B A    
41 Athens 2.7.4   B    
42 Athens 2.7.5   A    
43 Athens 2.7.6   A    
44 Athens 2.9.1   B    
45 Athens 2.9.2   A    
46 Athens 2.12.1  B    
47 Athens 2.12.2  A    
48 Athens 2.15.1  A    
49 Athens 2.15.2  A    
50 Athens 2.15.3  A    
51 Athens 2.19    A    
52 Athens 3.8     A    
53 Athens 1       B    
54 Athens 2.1     A    
55 Athens 2.2     A    
56 Athens 2.5.1   B    
57 Athens 2.5.4   B    
58 Athens 2.7.1   A    
59 Athens 2.8.1   B    
60 Athens 2.8.2   B    
61 Athens 2.13    B    
62 Athens 2.13.A  A    
63 Athens 2.13.B  A    
64 Athens 2.13.C  A    
65 Athens 2.13.D  A    
66 Athens 2.14.3  B    
67 Athens 2.17    A    
68 Athens 2.18    A    
69 Athens 3.1     A    
70 Athens 3.2     A    
71 Athens 3.3     B  

【讨论】:

  • 示例表有重复项。但是,完整的 dput 具有完整的数据。我只是展示了一个需要保留的示例。
【解决方案2】:

我认为以下解决方案会对您有所帮助:

library(dplyr)

df %>% 
  group_by(county, mid) %>%
  mutate(duplicate = n() > 1) %>% 
  filter(!duplicate | (duplicate & kpi == "B")) %>% 
  select(-duplicate)


# A tibble: 71 x 3
# Groups:   county, mid [71]
   county mid   kpi  
   <chr>  <chr> <chr>
 1 Athens 1.1   A    
 2 Athens 1.2   A    
 3 Athens 1.3   A    
 4 Athens 1.4   A    
 5 Athens 1.5   A    
 6 Athens 1.6   A    
 7 Athens 2.1.1 A    
 8 Athens 2.1.2 A    
 9 Athens 2.1.3 A    
10 Athens 2.1.4 A    
# ... with 61 more rows

【讨论】:

    【解决方案3】:

    我们可以使用add_count

    library(dplyr)
    df %>% 
      add_count(county, measure_id) %>%
      filter(n < 2|(n > 1 & kpi == 'B')) %>%
      select(-n)
    

    【讨论】:

    • @user5249203 如果您的数据更具代表性,我可以使用紧凑的代码进行测试。谢谢
    猜你喜欢
    • 1970-01-01
    • 2014-07-23
    • 2019-02-13
    • 2021-09-09
    • 1970-01-01
    • 2020-04-13
    • 2020-05-28
    • 2018-12-01
    • 2016-07-11
    相关资源
    最近更新 更多