在两个数据框中寻找共同的条目答案

【问题标题】：look for common entries in two data frames在两个数据框中寻找共同的条目
【发布时间】：2020-03-28 11:12:59
【问题描述】：

df1:
 state group species
1 CA 2 cat, dog, chicken, mouse
2 CA 1 cat
3 NV 1 dog, chicken
4 NV 2 chicken
5 WA 1 chicken, rat, mouse, lion
6 WA 2 dog, cat
7 WA 3 dog, chicken
8 WA 4 cat, chicken

df2:
 state special_species
1 CA cat
2 CA chicken
3 CA mouse
4 WA cat
5 WA chicken
6 NV dog

我有兴趣确定df2 中的哪些特殊物种出现在df1 中。我想要一个包含state、group 和special_species 的新数据框。

预期输出：

state group special_species
CA 2 cat, chicken, mouse
CA 1 cat
NV 1 dog
NV 2 NA
WA 1 chicken
WA 2 cat
WA 3 chicken
WA 4 cat, chicken

【问题讨论】：

你能告诉我们你预期的输出是什么吗？这将有助于结束循环并确认您对所提问题的理解。
刚刚用我预期的输出更新了帖子 - 谢谢！

标签： r dplyr tidyr summarize

【解决方案1】：

这比我想象的要难。我认为以下方法可行，但希望有人能想出更漂亮的东西。

首先，我们制作一些数据以供使用（请以后自己做），如果其他人想尝试，我会包括在内：

library(tidyverse)

df1 <- tribble(
  ~state, ~group, ~species,
  "CA", 2, "cat, dog, chicken, mouse",
  "CA", 1, "cat",
  "NV", 1, "dog, chicken",
  "NV", 2, "chicken",
  "WA", 1, "chicken, rat, mouse, lion",
  "WA", 2, "dog, cat",
  "WA", 3, "dog, chicken",
  "WA", 4, "cat, chicken")

df2 <- tribble(
  ~state, ~special_species,
  "CA", "cat",
  "CA", "chicken",
  "CA", "mouse",
  "WA", "cat",
  "WA", "chicken",
  "NV", "dog")

那么解决方法是：

df1 %>% 
  separate_rows(species) %>% 
  full_join(df2, on = "state") %>%
  filter(species == special_species) %>%
  group_by(state, group) %>%
  summarise(species = paste(special_species, collapse = ", ")) %>%
  full_join(df1, by = c("state" = "state", "group" = "group")) %>%
  select(state, group, special_species = species.x) %>%
  arrange(state)
#> Joining, by = "state"
#> # A tibble: 8 x 3
#> # Groups:   state [3]
#>   state group special_species    
#>   <chr> <dbl> <chr>              
#> 1 CA        1 cat                
#> 2 CA        2 cat, chicken, mouse
#> 3 NV        1 dog                
#> 4 NV        2 <NA>               
#> 5 WA        1 chicken            
#> 6 WA        2 cat                
#> 7 WA        3 chicken            
#> 8 WA        4 cat, chicken

如果您接受格式略有不同的所需输出，则代码可以大大简化，例如以下是正确的保存NA：

df1 %>% 
  separate_rows(species) %>% 
  full_join(df2, on = "state") %>%
  filter(species == special_species) %>%
  group_by(state, group) %>%
  summarise(species = paste(special_species, collapse = ", "))
#> Joining, by = "state"
#> # A tibble: 7 x 3
#> # Groups:   state [3]
#>   state group species            
#>   <chr> <dbl> <chr>              
#> 1 CA        1 cat                
#> 2 CA        2 cat, chicken, mouse
#> 3 NV        1 dog                
#> 4 WA        1 chicken            
#> 5 WA        2 cat                
#> 6 WA        3 chicken            
#> 7 WA        4 cat, chicken

^{由reprex package (v0.3.0) 于 2019 年 12 月 3 日创建}

【讨论】：

【解决方案2】：

这是一个data.table 实现，它定义了一个逐行查找匹配项的函数。这个问题可能有更有效的解决方案，但这里有一种可能性：

# Import the data.table package
library(data.table)

df1 <- data.frame(state = c("CA", "CA", "NV", "NV", "WA", "WA", "WA", "WA"), group = c(2, 1, 1, 2, 1, 2, 3, 4), species = c("cat, dog, chicken, mouse", "cat", "dog, chicken", "chicken", "chicken, rat, mouse, lion", "dog, cat", "dog, chicken", "cat, chicken"))
df2 <- data.frame(state = c("CA", "CA", "CA", "WA", "WA", "NV"), special_species = c("cat", "chicken", "mouse", "cat", "chicken", "dog"))

# Convertint to data table
df1 <- as.data.table(df1)
df2 <- as.data.table(df2)

# Create a function to find matches and return the relevant species
# Steps through df1 row by row
fn_find_matches <- function(sel_row){
  # Get the relevant row information
  comp_row <- df1[sel_row]
  species <- trimws(unlist(strsplit(as.vector(comp_row$species), ",")))

  # Retrieve the relevant df2 information for the state
  comp_tbl <- df2[state == comp_row$state]
  species <- species[species %in% comp_tbl$special_species]

  # If there are no mathcing species, return NA
  if(length(species > 0)){
    comp_row$species <- paste(species, collapse = ", ")
  } else {
    comp_row$species <- NA
  }
  return(comp_row)

}
# Create a resulting table
result_table <- rbindlist(lapply(c(1:nrow(df1)), fn_find_matches))
# Convert back to data frame if desired
setDF(result_table)
setDF(df1)
setDF(df2)

【讨论】：

很高兴看到我不是唯一一个发现这不平凡的人。
可能有一种方法可以使用merge 来完成此操作，但我只是不确定是否是在我的脑海中。祝你好运！

【解决方案3】：

这是基于 MSR 的回答。我们可以使用semi_join来简化代码。

library(tidyverse)

df3 <- df1 %>%
  separate_rows(species) %>%
  semi_join(df2, by = c("state", "species" = "special_species")) %>%
  group_by(state, group) %>%
  summarize(species = toString(species)) %>%
  ungroup() %>%
  complete(state, group = full_seq(group, period = 1)) %>%
  semi_join(df1, by = c("state", "group"))
df3
# # A tibble: 8 x 3
#   state group species            
#   <chr> <dbl> <chr>              
# 1 CA        1 cat                
# 2 CA        2 cat, chicken, mouse
# 3 NV        1 dog                
# 4 NV        2 NA                 
# 5 WA        1 chicken            
# 6 WA        2 cat                
# 7 WA        3 chicken            
# 8 WA        4 cat, chicken

数据

library(tidyverse)

df1 <- tribble(
  ~state, ~group, ~species,
  "CA", 2, "cat, dog, chicken, mouse",
  "CA", 1, "cat",
  "NV", 1, "dog, chicken",
  "NV", 2, "chicken",
  "WA", 1, "chicken, rat, mouse, lion",
  "WA", 2, "dog, cat",
  "WA", 3, "dog, chicken",
  "WA", 4, "cat, chicken")

df2 <- tribble(
  ~state, ~special_species,
  "CA", "cat",
  "CA", "chicken",
  "CA", "mouse",
  "WA", "cat",
  "WA", "chicken",
  "NV", "dog")

【讨论】：