【问题标题】:look for common entries in two data frames在两个数据框中寻找共同的条目
【发布时间】:2020-03-28 11:12:59
【问题描述】:
df1:
 state group species
1 CA 2 cat, dog, chicken, mouse
2 CA 1 cat
3 NV 1 dog, chicken
4 NV 2 chicken
5 WA 1 chicken, rat, mouse, lion
6 WA 2 dog, cat
7 WA 3 dog, chicken
8 WA 4 cat, chicken

df2:
 state special_species
1 CA cat
2 CA chicken
3 CA mouse
4 WA cat
5 WA chicken
6 NV dog

我有兴趣确定df2 中的哪些特殊物种出现在df1 中。我想要一个包含stategroupspecial_species 的新数据框。

预期输出:

state group special_species
CA 2 cat, chicken, mouse
CA 1 cat
NV 1 dog
NV 2 NA
WA 1 chicken
WA 2 cat
WA 3 chicken
WA 4 cat, chicken

【问题讨论】:

  • 你能告诉我们你预期的输出是什么吗?这将有助于结束循环并确认您对所提问题的理解。
  • 刚刚用我预期的输出更新了帖子 - 谢谢!

标签: r dplyr tidyr summarize


【解决方案1】:

这比我想象的要难。我认为以下方法可行,但希望有人能想出更漂亮的东西。

首先,我们制作一些数据以供使用(请以后自己做),如果其他人想尝试,我会包括在内:

library(tidyverse)

df1 <- tribble(
  ~state, ~group, ~species,
  "CA", 2, "cat, dog, chicken, mouse",
  "CA", 1, "cat",
  "NV", 1, "dog, chicken",
  "NV", 2, "chicken",
  "WA", 1, "chicken, rat, mouse, lion",
  "WA", 2, "dog, cat",
  "WA", 3, "dog, chicken",
  "WA", 4, "cat, chicken")

df2 <- tribble(
  ~state, ~special_species,
  "CA", "cat",
  "CA", "chicken",
  "CA", "mouse",
  "WA", "cat",
  "WA", "chicken",
  "NV", "dog")

那么解决方法是:

df1 %>% 
  separate_rows(species) %>% 
  full_join(df2, on = "state") %>%
  filter(species == special_species) %>%
  group_by(state, group) %>%
  summarise(species = paste(special_species, collapse = ", ")) %>%
  full_join(df1, by = c("state" = "state", "group" = "group")) %>%
  select(state, group, special_species = species.x) %>%
  arrange(state)
#> Joining, by = "state"
#> # A tibble: 8 x 3
#> # Groups:   state [3]
#>   state group special_species    
#>   <chr> <dbl> <chr>              
#> 1 CA        1 cat                
#> 2 CA        2 cat, chicken, mouse
#> 3 NV        1 dog                
#> 4 NV        2 <NA>               
#> 5 WA        1 chicken            
#> 6 WA        2 cat                
#> 7 WA        3 chicken            
#> 8 WA        4 cat, chicken

如果您接受格式略有不同的所需输出,则代码可以大大简化,例如以下是正确的保存NA

df1 %>% 
  separate_rows(species) %>% 
  full_join(df2, on = "state") %>%
  filter(species == special_species) %>%
  group_by(state, group) %>%
  summarise(species = paste(special_species, collapse = ", "))
#> Joining, by = "state"
#> # A tibble: 7 x 3
#> # Groups:   state [3]
#>   state group species            
#>   <chr> <dbl> <chr>              
#> 1 CA        1 cat                
#> 2 CA        2 cat, chicken, mouse
#> 3 NV        1 dog                
#> 4 WA        1 chicken            
#> 5 WA        2 cat                
#> 6 WA        3 chicken            
#> 7 WA        4 cat, chicken

reprex package (v0.3.0) 于 2019 年 12 月 3 日创建

【讨论】:

    【解决方案2】:

    这是一个data.table 实现,它定义了一个逐行查找匹配项的函数。这个问题可能有更有效的解决方案,但这里有一种可能性:

    # Import the data.table package
    library(data.table)
    
    df1 <- data.frame(state = c("CA", "CA", "NV", "NV", "WA", "WA", "WA", "WA"), group = c(2, 1, 1, 2, 1, 2, 3, 4), species = c("cat, dog, chicken, mouse", "cat", "dog, chicken", "chicken", "chicken, rat, mouse, lion", "dog, cat", "dog, chicken", "cat, chicken"))
    df2 <- data.frame(state = c("CA", "CA", "CA", "WA", "WA", "NV"), special_species = c("cat", "chicken", "mouse", "cat", "chicken", "dog"))
    
    # Convertint to data table
    df1 <- as.data.table(df1)
    df2 <- as.data.table(df2)
    
    # Create a function to find matches and return the relevant species
    # Steps through df1 row by row
    fn_find_matches <- function(sel_row){
      # Get the relevant row information
      comp_row <- df1[sel_row]
      species <- trimws(unlist(strsplit(as.vector(comp_row$species), ",")))
    
      # Retrieve the relevant df2 information for the state
      comp_tbl <- df2[state == comp_row$state]
      species <- species[species %in% comp_tbl$special_species]
    
      # If there are no mathcing species, return NA
      if(length(species > 0)){
        comp_row$species <- paste(species, collapse = ", ")
      } else {
        comp_row$species <- NA
      }
      return(comp_row)
    
    }
    # Create a resulting table
    result_table <- rbindlist(lapply(c(1:nrow(df1)), fn_find_matches))
    # Convert back to data frame if desired
    setDF(result_table)
    setDF(df1)
    setDF(df2)
    

    【讨论】:

    • 很高兴看到我不是唯一一个发现这不平凡的人。
    • 可能有一种方法可以使用merge 来完成此操作,但我只是不确定是否是在我的脑海中。祝你好运!
    【解决方案3】:

    这是基于 MSR 的回答。我们可以使用semi_join来简化代码。

    library(tidyverse)
    
    df3 <- df1 %>%
      separate_rows(species) %>%
      semi_join(df2, by = c("state", "species" = "special_species")) %>%
      group_by(state, group) %>%
      summarize(species = toString(species)) %>%
      ungroup() %>%
      complete(state, group = full_seq(group, period = 1)) %>%
      semi_join(df1, by = c("state", "group"))
    df3
    # # A tibble: 8 x 3
    #   state group species            
    #   <chr> <dbl> <chr>              
    # 1 CA        1 cat                
    # 2 CA        2 cat, chicken, mouse
    # 3 NV        1 dog                
    # 4 NV        2 NA                 
    # 5 WA        1 chicken            
    # 6 WA        2 cat                
    # 7 WA        3 chicken            
    # 8 WA        4 cat, chicken  
    

    数据

    library(tidyverse)
    
    df1 <- tribble(
      ~state, ~group, ~species,
      "CA", 2, "cat, dog, chicken, mouse",
      "CA", 1, "cat",
      "NV", 1, "dog, chicken",
      "NV", 2, "chicken",
      "WA", 1, "chicken, rat, mouse, lion",
      "WA", 2, "dog, cat",
      "WA", 3, "dog, chicken",
      "WA", 4, "cat, chicken")
    
    df2 <- tribble(
      ~state, ~special_species,
      "CA", "cat",
      "CA", "chicken",
      "CA", "mouse",
      "WA", "cat",
      "WA", "chicken",
      "NV", "dog")
    

    【讨论】:

      猜你喜欢
      • 2014-11-17
      • 1970-01-01
      • 2022-11-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-01-22
      • 2017-01-31
      相关资源
      最近更新 更多