在 R 中，希望从某些行中删除重复项，并在其他行中合并行答案

【问题标题】：In R, looking to remove duplicates from certain rows, and combine rows on others在 R 中，希望从某些行中删除重复项，并在其他行中合并行
【发布时间】：2021-03-16 22:19:16
【问题描述】：

我有一个包含 5 列的数据框。我想根据“OPP_ID”列删除重复项，但想合并最后两列“Sales”和“Marketing”的记录。最后两列也有 NA。我尝试了几种方法，但都没有达到预期的效果。

这是初始表

|  Name       |  Company   |  Opp_id  |  Sales  | Marketing
|  John S.    |  Amazon    |  12354   |  Yes    |  NA 
|  Bill W.    |  Google    |  15566   |  NA     |  Yes
|  Darryl W.  |  Facebook  |  98456   |  NA     |  Yes
|  Darryl W.  |  Facebook  |  98456   |  Yes    |  NA
|  Tom S.     |  Zillow    |  87423   |  NA     |  Yes
|  Tom S.     |  Zillow    |  87423   |  Yes    |  NA
|  Tom S.     |  Zillow    |  87423   |  Yes    |  NA

这是所需的结果表：

|  Name      |  Company    |  Opp_ID   |  Sales |  Marketing
|  John S.   |  Amazon     |  12354    |  Yes   |  NA
|  Bill W.   |  Google     |  15566    |  NA    |  Yes
|  Darryl W. |  Facebook   |  98456    |  Yes   |  Yes
|  Tom S.    |  Zillow     |  87423    |  Yes   |  NA

【问题讨论】：

标签： r dplyr data-wrangling

【解决方案1】：

如果我正确理解您的问题，这可能是 dplyr 的解决方案：

# reading in the data you supplied ( I removed the leading | )
library(data.table)
df <- data.table::fread(" Name       |  Company   |  Opp_id  |  Sales  | Marketing
John S.    |  Amazon    |  12354   |  Yes    |  NA 
Bill W.    |  Google    |  15566   |  NA     |  Yes
Darryl W.  |  Facebook  |  98456   |  NA     |  Yes
Darryl W.  |  Facebook  |  98456   |  Yes    |  NA
Tom S.     |  Zillow    |  87423   |  NA     |  Yes
Tom S.     |  Zillow    |  87423   |  Yes    |  NA
Tom S.     |  Zillow    |  87423   |  Yes    |  NA")

# calculations
library(dplyr)
df %>% 
  # If value is na then convert to FALSE else to TRUE for two columns as the same function is needed
  dplyr::mutate(across(c(Sales, Marketing), ~ifelse(is.na(.x), FALSE, TRUE))) %>% 
  # Build the grouping (I am supposing they are 100% matching, else keep only Opp_id)
  dplyr::group_by(Name, Company, Opp_id) %>% 
  # In summarise comprise the grouping to unique combinations of names variables and any delivers TRUE if at least one TRUE is found in the group
  dplyr::summarise(across(c(Sales, Marketing), ~ any(.x))) %>%
  # always safer to remove the grouping unless you need it specificaly
  dplyr::ungroup() 

 # be aware that the output war reordered
  Name      Company  Opp_id Sales Marketing
  <chr>     <chr>     <int> <lgl> <lgl>    
1 Bill W.   Google    15566 FALSE TRUE     
2 Darryl W. Facebook  98456 TRUE  TRUE     
3 John S.   Amazon    12354 TRUE  FALSE    
4 Tom S.    Zillow    87423 TRUE  TRUE

【讨论】：

谢谢，虽然我在运行代码时遇到“生命周期”包错误，但我已经加载了包/库，但由于某种原因仍然出现此错误。有什么想法吗？ loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) 中的错误：没有名为“生命周期”的包
我最好的猜测是重新安装 tidyverse 和/或 dplyr 包以及生命周期包（也许它已经与 tidyverse 一起提供）

【解决方案2】：

我们可以按“名称”、“公司”、“Opp_id”和summariseacross“销售”、“营销”列分组，方法是选择order之后的first元素逻辑向量

library(dplyr)
df1 %>%
    group_by(Name, Company, Opp_id) %>% 
    summarise(across(c(Sales, Marketing),
       ~ first(.[order(is.na(.))])), .groups = 'drop')

-输出

# A tibble: 4 x 5
#  Name      Company  Opp_id Sales Marketing
#* <chr>     <chr>     <int> <chr> <chr>    
#1 Bill W.   Google    15566 <NA>  Yes      
#2 Darryl W. Facebook  98456 Yes   Yes      
#3 John S.   Amazon    12354 Yes   <NA>     
#4 Tom S.    Zillow    87423 Yes   Yes

数据

df1 <- structure(list(Name = c("John S.", "Bill W.", "Darryl W.", "Darryl W.", 
"Tom S.", "Tom S.", "Tom S."), Company = c("Amazon", "Google", 
"Facebook", "Facebook", "Zillow", "Zillow", "Zillow"), Opp_id = c(12354L, 
15566L, 98456L, 98456L, 87423L, 87423L, 87423L), Sales = c("Yes", 
NA, NA, "Yes", NA, "Yes", "Yes"), Marketing = c(NA, "Yes", "Yes", 
NA, "Yes", NA, NA)), class = "data.frame", row.names = c(NA, 
-7L))

【讨论】：