【问题标题】:Merge dataframes by groups common to both按双方共有的组合并数据框
【发布时间】:2017-04-22 16:29:36
【问题描述】:

我有两个由不同采样器采集的龙虾蛋大小数据集,将用于评估测量变异性。每个采样器从众多龙虾中测量约 50 个鸡蛋\龙虾。然而,有时一些龙虾是由采样器一而不是采样器二处理的,反之亦然。我想将来自两个采样器的数据组合为一个新数据集,但从仅由一个采样器处理的龙虾中删除所有数据。我已经用 semi_join 和 intersect 玩过 dplyr,但我需要在数据集 1 -> 2 和 2

这是我的数据的简化版本,其中对多只龙虾进行了多个鸡蛋面积测量,但采样并不总是重叠(即,仅由一个采样器而不是另一个采样器从个体测量鸡蛋):

install.packages(dplyr)
library(dplyr)

sampler1 <- data.frame(LobsterID=c("Lobster1","Lobster1","Lobster2",
                                   "Lobster2","Lobster2","Lobster2",
                                   "Lobster2","Lobster3","Lobster3","Lobster3"),
                       Area=c(.4,.35,1.1,1.04,1.14,1.1,1.05,1.7,1.63,1.8),
                       Sampler=c(rep("Sampler1", 10)))
sampler2 <- data.frame(LobsterID=c("Lobster1","Lobster1","Lobster1",
                                   "Lobster1","Lobster1","Lobster2",
                                   "Lobster2","Lobster2","Lobster4","Lobster4"),
                       Area=c(.41,.44,.47,.43,.38,1.14,1.11,1.09,1.41,1.4),
                       Sampler=c(rep("Sampler2", 10)))

combined <- bind_rows(sampler1, sampler2)

desiredresult <- combined[-c(8, 9, 10, 19, 20), ]

脚本的最后一行是模拟数据的预期结果。我希望限制使用 base R 或 dplyr。

【问题讨论】:

    标签: r merge dplyr


    【解决方案1】:
    sampler1 %>% rbind(sampler2) %>% filter(LobsterID %in% intersect(sampler1$LobsterID, sampler2$LobsterID))
    

    【讨论】:

      【解决方案2】:
      combined <- bind_rows(sampler1, sampler2)
      
      
      Lobsters.2.sample <- as.character(unique(sampler1$LobsterID)[unique(sampler1$LobsterID) %in% unique(sampler2$LobsterID)])
      
      combined <- combined[combined$LobsterID %in% Lobsters.2.sample,]
      

      【讨论】:

        【解决方案3】:

        使用基础R

        combined <-rbind(sampler1, sampler2)
        inBoth <- intersect(sampler1[["LobsterID"]], sampler2[["LobsterID"]])
        output <- combined[combined[["LobsterID"]] %in% inBoth, ]
        

        intersect 找到两个向量的集合并集,为您提供两个样本中的龙虾。所有函数都是矢量化的,所以它应该运行得非常快。

        【讨论】:

          【解决方案4】:

          按每组中不同采样器的数量绑定行、组和过滤器:

          sampler1 %>% bind_rows(sampler2) %>% 
              group_by(LobsterID) %>% 
              filter(n_distinct(Sampler) == 2)
          
          ## Source: local data frame [15 x 3]
          ## Groups: LobsterID [2]
          ## 
          ##    LobsterID  Area  Sampler
          ##        <chr> <dbl>    <chr>
          ## 1   Lobster1  0.40 Sampler1
          ## 2   Lobster1  0.35 Sampler1
          ## 3   Lobster2  1.10 Sampler1
          ## 4   Lobster2  1.04 Sampler1
          ## 5   Lobster2  1.14 Sampler1
          ## 6   Lobster2  1.10 Sampler1
          ## 7   Lobster2  1.05 Sampler1
          ## 8   Lobster1  0.41 Sampler2
          ## 9   Lobster1  0.44 Sampler2
          ## 10  Lobster1  0.47 Sampler2
          ## 11  Lobster1  0.43 Sampler2
          ## 12  Lobster1  0.38 Sampler2
          ## 13  Lobster2  1.14 Sampler2
          ## 14  Lobster2  1.11 Sampler2
          ## 15  Lobster2  1.09 Sampler2
          

          【讨论】:

            【解决方案5】:

            这是一个使用data.table 的选项。使用rbindlist 绑定数据集,按“LobsterID”分组,并使用基于“Sampler”中唯一元素数(即等于2)的逻辑条件对行进行子集化。

            library(data.table)
            rbindlist(list(sampler1, sampler2))[, if(uniqueN(Sampler)==2) .SD , by = LobsterID]
            

            【讨论】:

              猜你喜欢
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 2019-09-18
              • 2023-04-05
              • 1970-01-01
              • 2011-06-19
              • 1970-01-01
              相关资源
              最近更新 更多