如何将两组数据融合成R中的两列？ [复制]答案

【问题标题】：How to melt two groups of data into two columns in R? [duplicate]如何将两组数据融合成R中的两列？ [复制]
【发布时间】：2019-03-25 06:51:22
【问题描述】：

我在下面有这些数据。我想熔化一列中的所有No_of.reads cloumns 和另一列中的所有_contamination_ 列。所以最终的数据框将有diluted_sample、No_of_reads 和_contamination_ 列。我尝试分两步执行此操作，但这会给我反复观察。正确的做法是什么？

代码：

test.dput.melted <- melt(test.dput, id = 1:3, measure = 4:7)
test.dput.melted <- melt(test.dput.melted, id = c(1,4,5), measure = 2:3)

数据：

 test.dput<- structure(list(diluted_sample = c("100%", "95%", "90%", "85%", 
"80%", "75%"), No_of_reads_from_NA12878 = c("15,000,000", "14,250,000", 
"13,500,000", "12,750,000", "12,000,000", "11,250,000"), No_of_reads_from_NA12877 = c("0", 
"750,000", "1,500,000", "2,250,000", "3,000,000", "3,750,000"
), tEst_contamination_of_NA12878 = c("99.60%", "99.10%", "96.80%", 
"92.60%", "88%", "82.60%"), pair_contamination_of_NA12878 = c("100.00%", 
"94.15%", "88.72%", "83.36%", "78.20%", "73.08%"), tEst_contamination_of_NA12877 = c("0.10%", 
"7%", "13.60%", "20.10%", "26.20%", "32.10%"), pair_contamination_of_NA12877 = c("0.10%", 
"5.21%", "10.50%", "15.85%", "20.92%", "26.04%")), .Names = c("diluted_sample", 
"No_of_reads_from_NA12878", "No_of_reads_from_NA12877", "tEst_contamination_of_NA12878", 
"pair_contamination_of_NA12878", "tEst_contamination_of_NA12877", 
"pair_contamination_of_NA12877"), row.names = c(NA, 6L), class = "data.frame")

【问题讨论】：

您想要所有可能的读数和污染组合吗？
@iod 我不想要组合，只想要带有读取和污染列的长格式数据。
我现在不在我的电脑前，但我的解决方案是创建一个带有熔化读取列的 df，和一个带有熔化污染列的 df，然后执行 left_join。

标签： r data.table melt

【解决方案1】：

由于您标记了data.table 和melt

library(magrittr)
library(data.table)
setDT(test.dput)


n.reads <- 
  test.dput[, grep('diluted|reads', names(test.dput)), with = F] %>% 
    melt(1, variable.name = 'Which_No_of_reads',
            value.name    = 'No_of_reads') %>% 
    .[, Which_No_of_reads := gsub('No_of_reads_from_', '', Which_No_of_reads)]

contam <- 
  test.dput[, grep('diluted|contamination', names(test.dput)), with = F] %>% 
    melt(1, variable.name = 'Which_contamination',
            value.name    = '_contamination_') %>% 
    .[, Which_contamination := gsub('contamination_of_', '', Which_contamination)]

cbind(n.reads, contam) %>% 
  .[, unique(names(.)), with = F]

#    diluted_sample Which_No_of_reads No_of_reads Which_contamination _contamination_
#  1:           100%           NA12878  15,000,000        tEst_NA12878          99.60%
#  2:            95%           NA12878  14,250,000        tEst_NA12878          99.10%
#  3:            90%           NA12878  13,500,000        tEst_NA12878          96.80%
#  4:            85%           NA12878  12,750,000        tEst_NA12878          92.60%
#  5:            80%           NA12878  12,000,000        tEst_NA12878             88%
#  6:            75%           NA12878  11,250,000        tEst_NA12878          82.60%
#  7:           100%           NA12877           0        pair_NA12878         100.00%
#  8:            95%           NA12877     750,000        pair_NA12878          94.15%
#  9:            90%           NA12877   1,500,000        pair_NA12878          88.72%
# 10:            85%           NA12877   2,250,000        pair_NA12878          83.36%
# 11:            80%           NA12877   3,000,000        pair_NA12878          78.20%
# 12:            75%           NA12877   3,750,000        pair_NA12878          73.08%
# 13:           100%           NA12878  15,000,000        tEst_NA12877           0.10%
# 14:            95%           NA12878  14,250,000        tEst_NA12877              7%
# 15:            90%           NA12878  13,500,000        tEst_NA12877          13.60%
# 16:            85%           NA12878  12,750,000        tEst_NA12877          20.10%
# 17:            80%           NA12878  12,000,000        tEst_NA12877          26.20%
# 18:            75%           NA12878  11,250,000        tEst_NA12877          32.10%
# 19:           100%           NA12877           0        pair_NA12877           0.10%
# 20:            95%           NA12877     750,000        pair_NA12877           5.21%
# 21:            90%           NA12877   1,500,000        pair_NA12877          10.50%
# 22:            85%           NA12877   2,250,000        pair_NA12877          15.85%
# 23:            80%           NA12877   3,000,000        pair_NA12877          20.92%
# 24:            75%           NA12877   3,750,000        pair_NA12877          26.04%

【讨论】：

不知道为什么你的解决方案给我的结果与我上面的代码相似，但它不起作用。然而，iod 的解决方案对我有用。还是谢谢！

【解决方案2】：

使用tidyr::gather 和dplyr：

test.melted<-gather(test.dput,key="reads_source",value="reads",starts_with("No_of_reads"))
test.melted.NA12878<-test.melted[test.melted$reads_source=="No_of_reads_from_NA12878",] %>% 
  gather(key="contamination_type",value="contamination",
         contains("contamination_of_NA12878"))
test.melted.NA12877<-test.melted[test.melted$reads_source=="No_of_reads_from_NA12877",] %>% 
  gather(key="contamination_type",value="contamination",
         contains("contamination_of_NA12877"))
test.melted.full<-rbind(test.melted.NA12877[,c(-2:-3)],test.melted.NA12878[,c(-2:-3)])

这个解决方案显然只适用于这个特定的数据集。如果您有更多阅读站点，也可以使用 greping 和可能的 for 循环来管理。

【讨论】：

谢谢，但这仍然给出了过度表示的值。
你能指出这些吗？我不确定你的意思
更新了我的解决方案
完美！现在终于可以使用了。