【问题标题】:Merging replicate scores but mark the differences合并重复分数但标记差异
【发布时间】:2019-09-18 04:02:05
【问题描述】:

这就是我所拥有的:

df <- structure(list(Sample = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 
                                    4L), .Label = c("19-0001", "19-0002", "19-0003", "19-0004"), class = "factor"), 
               Replicate = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), X24854000 = structure(c(1L, 
                                                                                      2L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("", "CC"), class = "factor"), 
               X24854056 = structure(c(3L, 3L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
                                                                                   "AA", "GG"), class = "factor"), X24854764 = structure(c(1L, 
                                                                                                                                           1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "TA", class = "factor"), 
               X24854903 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("", 
                                                                                   "CT"), class = "factor"), X24855066 = structure(c(1L, 1L, 
                                                                                                                                     3L, 3L, 2L, 2L, 2L, 2L), .Label = c("", "CA", "CC"), class = "factor"), 
               X24855114 = structure(c(2L, 1L, 3L, 3L, 2L, 2L, 2L, 2L), .Label = c("", 
                                                                                   "GA", "GG"), class = "factor"), X24855316 = structure(c(2L, 
                                                                                                                                           2L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("", "TC"), class = "factor"), 
               X24855449 = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("CC", 
                                                                                   "GG"), class = "factor"), X24855925 = structure(c(2L, 1L, 
                                                                                                                                     1L, 3L, 2L, 2L, 1L, 1L), .Label = c("", "GA", "GG"), class = "factor"), 
               X24856070 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("CC", 
                                                                                   "CT"), class = "factor"), X24856086 = structure(c(2L, 1L, 
                                                                                                                                     2L, 2L, 2L, 2L, 2L, 2L), .Label = c("CC", "CT"), class = "factor"), 
               X24856329 = structure(c(2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", 
                                                                                   "AG"), class = "factor"), X24856389 = structure(c(2L, 1L, 
                                                                                                                                     1L, 1L, 2L, 2L, 2L, 2L), .Label = c("", "GG"), class = "factor"), 
               X24857235 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("", 
                                                                                   "CT"), class = "factor"), X24857350 = structure(c(3L, 3L, 
                                                                                                                                     1L, 1L, 2L, 2L, 1L, 1L), .Label = c("", "GA", "GG"), class = "factor"), 
               X24857404 = structure(c(1L, 3L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("", 
                                                                                   "AT", "TT"), class = "factor")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                        -8L))

这会生成这个表

Sample  Replicate   X24854000   X24854056   X24854764   X24854903   X24855066   X24855114   X24855316   X24855449   X24855925   X24856070   X24856086   X24856329   X24856389   X24857235   X24857350   X24857404
19-0001 1       GG  TA          GA  TC  CC  GA  CT  CT  AG  GG      GG
19-0001 2   CC  GG  TA              TC  GG      CC  CC              GG  TT
19-0002 1   CC  AA  TA      CC  GG      GG      CC  CT  AG
19-0002 2           TA      CC  GG      GG  GG  CC  CT  AG
19-0003 1   CC      TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0003 2   CC      TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0004 1           TA      CA  GA  TC  CC      CC  CT  AG  GG  CT
19-0004 2           TA      CA  GA      CC      CC  CT  AG  GG

这就是我想要的:

Sample  Replicate   X24854000   X24854056   X24854764   X24854903   X24855066   X24855114   X24855316   X24855449   X24855925   X24856070   X24856086   X24856329   X24856389   X24857235   X24857350   X24857404
19-0001 1   CC  GG  TA          GA  TC  99  GA  99  99  AG  GG      GG  TT
19-0002 1   CC  AA  TA      CC  GG      GG  GG  CC  CT  AG
19-0003 1   CC      TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0004 1           TA      CA  GA  TC  CC      CC  CT  AG  GG  CT

将重复 1 和 2 合并到相同的样本名称下。缺失或相同的分数可以用另一个替换,但任何不匹配的都应替换为“99”,以便以后将其删除。

我试过了:

data_merge <- data %>%
    group_by(Sample) %>%
    summarise_all(ifelse(statement), (if_true), (if_false))

我只对数据进行子集化,真实数据有 44 个 X 数。

【问题讨论】:

  • 请以可重现的格式提供样本数据,例如使用dput
  • 我对 dput 不熟悉,我尝试了 dput(out, file = "test.txt", control = c("keepNA", "keepInteger")) 但输出文件看起来不与输入一相同。
  • dput 的使用在一篇关于如何提供minimal reproducible example 的帖子中进行了解释。简而言之,执行dput(df)(其中df 是您的data.frame),然后在您的主帖中包含(即复制和粘贴)dput 的输出(而不是作为评论)。
  • 谢谢。与包本身的说明相比,该链接实际上非常有用。下次遇到 R 问题时,我会使用它。
  • 很高兴@RSun 有帮助。请考虑通过在答案旁边设置绿色复选标记来关闭问题。这样,您可以帮助保持 SO 整洁,并使未来的 SO 读者更容易识别相关问题。谢谢。

标签: r merge dplyr


【解决方案1】:

这是一个选项

df %>%
    mutate_if(is.factor, as.character) %>%
    group_by(Sample) %>%
    summarise_at(
        vars(starts_with("X")),
        ~if_else(length(unique(.x[.x != ""])) == 1, first(.x[.x != ""]), "99"))
## A tibble: 4 x 17
#  Sample X24854000 X24854056 X24854764 X24854903 X24855066 X24855114 X24855316
#  <chr>  <chr>     <chr>     <chr>     <chr>     <chr>     <chr>     <chr>
#1 19-00… CC        GG        TA        99        99        GA        TC
#2 19-00… CC        AA        TA        99        CC        GG        99
#3 19-00… CC        99        TA        CT        CA        GA        TC
#4 19-00… 99        99        TA        99        CA        GA        TC
## … with 9 more variables: X24855449 <chr>, X24855925 <chr>, X24856070 <chr>,
##   X24856086 <chr>, X24856329 <chr>, X24856389 <chr>, X24857235 <chr>,
##   X24857350 <chr>, X24857404 <chr>

样本数据

df <- read.table(text =
    "Sample  Replicate   X24854000   X24854056   X24854764   X24854903   X24855066   X24855114   X24855316   X24855449   X24855925   X24856070   X24856086   X24856329   X24856389   X24857235   X24857350   X24857404
19-0001 1   ''  GG  TA  ''  ''  GA  TC  CC  GA  CT  CT  AG  GG  ''  GG  ''
19-0001 2   CC  GG  TA  ''  ''  ''  TC  GG  ''  CC  CC  ''  ''  ''  GG  TT
19-0002 1   CC  AA  TA  ''  CC  GG  ''  GG  ''  CC  CT  AG  ''  ''  ''  ''
19-0002 2   ''  ''  TA  ''  CC  GG  ''  GG  GG  CC  CT  AG  ''  ''  ''  ''
19-0003 1   CC  ''  TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0003 2   CC  ''  TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0004 1   ''  ''  TA  ''  CA  GA  TC  CC  ''  CC  CT  AG  GG  CT  ''  ''
19-0004 2   ''  ''  TA  ''  CA  GA  ''  CC  ''  CC  CT  AG  GG  ''  ''  ''", header = T)

【讨论】:

  • 您好 Maurits,非常感谢您提供代码并重新创建示例数据。很抱歉让我感到痛苦,但我收到此错误消息“错误:未注册 tidyselect 变量调用rlang::last_error() 以查看回溯”。我使用了您重新创建的示例数据,但仍然有同样的错误。
  • @RSun 嗯,它适用于我。我已经在dplyr_0.8.3 上对此进行了测试。你有什么版本的dplyr
  • 也适用于两个复制都缺少分数的样本。我想将其留空或将 0 插入其中。只有我想用 99 替换的不匹配分数的复制。
  • @RSun “同样适用于两个重复都缺少分数的样本。我想将其留空或插入 0。” 这将成为一个比问题更复杂的问题陈述你原来的问题。让我们一步一步来:首先确认您可以重现我在回答中给出的示例。
  • @RSun 您仍然需要更新您的主帖以提供可重现示例数据(即包括dput 的输出)。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2013-03-26
  • 1970-01-01
  • 2020-06-15
  • 1970-01-01
  • 2017-04-01
  • 1970-01-01
  • 2018-06-17
相关资源
最近更新 更多