【问题标题】:In R, looking to remove duplicates from certain rows, and combine rows on others在 R 中,希望从某些行中删除重复项,并在其他行中合并行
【发布时间】:2021-03-16 22:19:16
【问题描述】:

我有一个包含 5 列的数据框。我想根据“OPP_ID”列删除重复项,但想合并最后两列“Sales”和“Marketing”的记录。最后两列也有 NA。我尝试了几种方法,但都没有达到预期的效果。

这是初始表

|  Name       |  Company   |  Opp_id  |  Sales  | Marketing
|  John S.    |  Amazon    |  12354   |  Yes    |  NA 
|  Bill W.    |  Google    |  15566   |  NA     |  Yes
|  Darryl W.  |  Facebook  |  98456   |  NA     |  Yes
|  Darryl W.  |  Facebook  |  98456   |  Yes    |  NA
|  Tom S.     |  Zillow    |  87423   |  NA     |  Yes
|  Tom S.     |  Zillow    |  87423   |  Yes    |  NA
|  Tom S.     |  Zillow    |  87423   |  Yes    |  NA

这是所需的结果表:

|  Name      |  Company    |  Opp_ID   |  Sales |  Marketing
|  John S.   |  Amazon     |  12354    |  Yes   |  NA
|  Bill W.   |  Google     |  15566    |  NA    |  Yes
|  Darryl W. |  Facebook   |  98456    |  Yes   |  Yes
|  Tom S.    |  Zillow     |  87423    |  Yes   |  NA

【问题讨论】:

    标签: r dplyr data-wrangling


    【解决方案1】:

    如果我正确理解您的问题,这可能是 dplyr 的解决方案:

    # reading in the data you supplied ( I removed the leading | )
    library(data.table)
    df <- data.table::fread(" Name       |  Company   |  Opp_id  |  Sales  | Marketing
    John S.    |  Amazon    |  12354   |  Yes    |  NA 
    Bill W.    |  Google    |  15566   |  NA     |  Yes
    Darryl W.  |  Facebook  |  98456   |  NA     |  Yes
    Darryl W.  |  Facebook  |  98456   |  Yes    |  NA
    Tom S.     |  Zillow    |  87423   |  NA     |  Yes
    Tom S.     |  Zillow    |  87423   |  Yes    |  NA
    Tom S.     |  Zillow    |  87423   |  Yes    |  NA")
    
    # calculations
    library(dplyr)
    df %>% 
      # If value is na then convert to FALSE else to TRUE for two columns as the same function is needed
      dplyr::mutate(across(c(Sales, Marketing), ~ifelse(is.na(.x), FALSE, TRUE))) %>% 
      # Build the grouping (I am supposing they are 100% matching, else keep only Opp_id)
      dplyr::group_by(Name, Company, Opp_id) %>% 
      # In summarise comprise the grouping to unique combinations of names variables and any delivers TRUE if at least one TRUE is found in the group
      dplyr::summarise(across(c(Sales, Marketing), ~ any(.x))) %>%
      # always safer to remove the grouping unless you need it specificaly
      dplyr::ungroup() 
    
     # be aware that the output war reordered
      Name      Company  Opp_id Sales Marketing
      <chr>     <chr>     <int> <lgl> <lgl>    
    1 Bill W.   Google    15566 FALSE TRUE     
    2 Darryl W. Facebook  98456 TRUE  TRUE     
    3 John S.   Amazon    12354 TRUE  FALSE    
    4 Tom S.    Zillow    87423 TRUE  TRUE  
    

    【讨论】:

    • 谢谢,虽然我在运行代码时遇到“生命周期”包错误,但我已经加载了包/库,但由于某种原因仍然出现此错误。有什么想法吗? loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) 中的错误:没有名为“生命周期”的包
    • 我最好的猜测是重新安装 tidyverse 和/或 dplyr 包以及生命周期包(也许它已经与 tidyverse 一起提供)
    【解决方案2】:

    我们可以按“名称”、“公司”、“Opp_id”和summariseacross“销售”、“营销”列分组,方法是选择order之后的first元素逻辑向量

    library(dplyr)
    df1 %>%
        group_by(Name, Company, Opp_id) %>% 
        summarise(across(c(Sales, Marketing),
           ~ first(.[order(is.na(.))])), .groups = 'drop')
    

    -输出

    # A tibble: 4 x 5
    #  Name      Company  Opp_id Sales Marketing
    #* <chr>     <chr>     <int> <chr> <chr>    
    #1 Bill W.   Google    15566 <NA>  Yes      
    #2 Darryl W. Facebook  98456 Yes   Yes      
    #3 John S.   Amazon    12354 Yes   <NA>     
    #4 Tom S.    Zillow    87423 Yes   Yes   
    

    数据

    df1 <- structure(list(Name = c("John S.", "Bill W.", "Darryl W.", "Darryl W.", 
    "Tom S.", "Tom S.", "Tom S."), Company = c("Amazon", "Google", 
    "Facebook", "Facebook", "Zillow", "Zillow", "Zillow"), Opp_id = c(12354L, 
    15566L, 98456L, 98456L, 87423L, 87423L, 87423L), Sales = c("Yes", 
    NA, NA, "Yes", NA, "Yes", "Yes"), Marketing = c(NA, "Yes", "Yes", 
    NA, "Yes", NA, NA)), class = "data.frame", row.names = c(NA, 
    -7L))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-06-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-11-14
      • 2020-09-28
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多