【问题标题】:How to compare two data frames/tables and extract data in R?如何比较两个数据框/表并在 R 中提取数据?
【发布时间】:2016-04-13 08:02:35
【问题描述】:

为了尝试提取下面两个数据框之间的不匹配,我已经设法创建了一个新的数据框,其中替换了不匹配。
我现在需要的是一个不匹配的列表:

dfA <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "CA"), animal3 = c("AA", "TT", "AG", "CA")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
# > dfA
#      animal1 animal2 animal3
# snp1      AA      AA      AA
# snp2      TT      TB      TT
# snp3      AG      AG      AG
# snp4      CA      CA      CA
dfB <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "DF"), animal3 = c("AA", "TB", "AG", "DF")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
#> dfB
#     animal1 animal2 animal3
#snp1      AA      AA      AA
#snp2      TT      TB      TB
#snp3      AG      AG      AG
#snp4      CA      DF      DF

为了澄清不匹配,这里将它们标记为 00:

#      animal1 animal2 animal3
# snp1      AA      AA      AA
# snp2      TT      TB      00
# snp3      AG      AG      AG
# snp4      CA      00      00

我需要以下输出:

structure(list(snpname = structure(c(1L, 2L, 2L), .Label = c("snp2", "snp4"), class = "factor"), animalname = structure(c(2L, 1L, 2L), .Label = c("animal2", "animal3"), class = "factor"), alleledfA = structure(c(2L, 1L, 1L), .Label = c("CA", "TT"), class = "factor"), alleledfB = structure(c(2L, 1L, 1L), .Label = c("DF", "TB"), class = "factor")), .Names = c("snpname", "animalname", "alleledfA", "alleledfB"), class = "data.frame", row.names = c(NA, -3L))
#  snpname animalname alleledfA alleledfB
#1    snp2    animal3        TT        TB
#2    snp4    animal2        CA        DF
#3    snp4    animal3        CA        DF

到目前为止,我一直在尝试从我的 lapply 函数中提取额外的数据,我用它来将不匹配替换为零,但没有成功。我还尝试编写一个 ifelse 函数但没有成功。希望大家能帮帮我!

最终这将针对维度为 100K x 1000 的数据集运行,因此效率很重要

【问题讨论】:

  • 您的澄清可以由:ifelse(as.matrix(dfA) == as.matrix(dfB), as.matrix(dfA), "00")
  • dfA 的行名是否总是与dfB 的行名匹配?
  • @lukeA 是的,我创建了两个子集,其中行名和列名将始终匹配。

标签: r dataframe compare data.table mismatch


【解决方案1】:

这个问题有data.table 标签,所以这是我使用这个包的尝试。第一步是将行名转换为列,因为data.table 不喜欢那些,然后在rbinding 之后转换为长格式并为每个数据集设置一个 id,找到有多个唯一值的位置并转换回宽幅格式

library(data.table)  
setDT(dfA, keep.rownames = TRUE) 
setDT(dfB, keep.rownames = TRUE)   

dcast(melt(rbind(dfA, 
                 dfB, 
                 idcol = TRUE), 
           id = 1:2
           )[, 
             if(uniqueN(value) > 1L) .SD, 
             by = .(rn, variable)], 
      rn + variable ~ .id)

#      rn variable  1  2
# 1: snp2  animal3 TT TB
# 2: snp4  animal2 CA DF
# 3: snp4  animal3 CA DF

【讨论】:

    【解决方案2】:

    这是使用矩阵的array.indices 的解决方案:

    i.arr <- which(dfA != dfB, arr.ind=TRUE)
    
    data.frame(snp=rownames(dfA)[i.arr[,1]], animal=colnames(dfA)[i.arr[,2]],
               A=dfA[i.arr], B=dfB[i.arr])
    #   snp  animal  A  B
    #1 snp4 animal2 CA DF
    #2 snp2 animal3 TT TB
    #3 snp4 animal3 CA DF
    

    【讨论】:

      【解决方案3】:

      这可以通过dplyr/tidyr 使用与@David Arenburg 的帖子中类似的方法来完成。

      library(dplyr)
      library(tidyr)
      bind_rows(add_rownames(dfA), add_rownames(dfB)) %>% 
                gather(Var, Val, -rowname) %>%
                group_by(rowname, Var) %>%
                filter(n_distinct(Val)>1) %>% 
                mutate(id = 1:2) %>% 
                spread(id, Val)
      #  rowname     Var     1     2
      #    (chr)   (chr) (chr) (chr)
      #1    snp2 animal3    TT    TB
      #2    snp4 animal2    CA    DF
      #3    snp4 animal3    CA    DF
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-04-01
        • 1970-01-01
        • 2017-10-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多