R - 合并多个大型数据框（整理）答案

【问题标题】：R - Merging Multiple Large Dataframes (Collating)R - 合并多个大型数据框（整理）
【发布时间】：2016-03-21 22:43:47
【问题描述】：

我更愿意在 R 中执行以下操作，但对（易于学习）其他解决方案持开放态度。

我有多个（比如说 99 个）制表符分隔的文件（我们称它们为 S1.txt 到 S99.txt）和表格，所有文件都具有完全相同的格式。每个表大约 2,000,000 列乘以 5 行。这是一个玩具示例：

ID    Chr    Position    DP1    DP2
A1    1       123        1.5    2.0
A2    1       124        1.4    0.3

根据定义，ID 是唯一的并且总是以相同的顺序排列，Chr 和 Pos 总是以相同的顺序排列。每个输入文件中唯一不同的是 DP1 列和 DP2 列。我想“整理”的输出表就是这个词。如果只有 3 个示例输入文件，这是一个输出示例。

ID    Chr    Position    S1.DP1  S1.DP2    S2.DP1    S2.DP2  S3.DP1  S3.DP2
A1    1       123        1.5      2.0       1.2        2.0     1.5     2.1
A2    1       124        1.4      0.3       1.0        0.5     0.5     0.05

请注意，每个输入文件都有一个为 DP1 和 DP2 创建的新列。此外，列的名称提供信息（告诉我它来自哪个输入文件以及哪个数据点 - DP）。

我发现了列何时不同的问题： R: merging a lot of data.frames 我也知道合并，虽然我觉得你最终得到了奇怪的列名： How to join (merge) data frames (inner, outer, left, right)?

我的另一个解决方案是初始化一个数据框，然后加载每个文件并添加数据点，但这会使用循环并且非常缓慢和可怕。所以，我需要一个更优雅的解决方案。感谢您的帮助。

【问题讨论】：

标签： r merge

【解决方案1】：

我重新阅读了您的问题并想到了更好的解决方案。

首先，我不会一次将所有 .txt 文件加载到 R 中。如果您的 .txt 文件是 2e6x5 并且其中有 100 个，那么在您全部加载它们之前，您可能会用完 RAM。我会一次加载一个并迭代合并它们。

library(readr) #Use this to load your data, it is much better than the base functions

f <- list.files(path = "path/to/file", pattern = "*.txt", full.names = TRUE)

d <- read_delim(f[1], delim = "\t") 

idx = c("ID", "Chr", "Position")

for (i in seq(2, length(f)){

    d_temp <- read_delim(f[i], delim = "\t")

    d <- merge(d, d_temp, by = idx)

    rm(d_temp) #not necessary but I like to include to make explicit
}

命名d

n <- expand.grid(paste0("S", seq(1, length(f)), c("DP1", "DP2"))
names(d)[!names(d) %in% idx] <- paste(n[ ,1], n[ ,2], sep = ".")

更新

呃，我错过了明显的，如果你真的有 100 个 2e6x5 .txt 文件，你可能无法使用 R 来完成这项任务。我怀疑是否有可能在 R 中存储 2e6X500 数据帧。即使您在具有大量 RAM 计算时间的服务器上也将是不平凡的。我认为未来最重要的问题是你想用这些数据做什么。一旦您回答了这个问题，您就可以有效地使用您的数据。

【讨论】：

我将所有样本放入一个文件中，以便为程序格式化它们。它要求文件采用我拥有的格式。我担心这对 R 来说太多了，但不知道有什么其他方法可以做到。我有一个上游程序为每个样本创建文件，但现在需要将它们全部放在一起用于下游程序。
还有其他相对简单的选择吗？我不介意学习新事物，但也需要为项目完成这项工作。谢谢。
@GaiusAugustus 我要做的第一件事是尝试其中一种建议的方法，看看它是否有效。但是 Unix 将是您最快的选择。为此，请使用cut 命令将唯一标识符（即 ID、Chr、位置）从 S2.txt 删除到 S99.txt。然后paste一起S1.txt到S99.txt。 Unix 不仅速度更快，而且永远不会受到内存限制。
我不知道粘贴命令！哇，这改变了我的世界。谢谢！我会听从这个建议。

【解决方案2】：

我将假设所有文件都存储在一个文件夹中，并且您希望在该文件夹中加载所有带有.txt 扩展名的文件。

## List all the files in the current directory that end in .txt
files <- list.files(path = ".", pattern = "*.txt")

## Load them into a list called datlist and name each element after the file it came from
datlist <- lapply(files, read.table, sep = "\t")
names(datlist) <- gsub("(*).txt", "\\1", files)

但是，出于可重现示例的目的，我将手动创建一个数据框列表，就像您展示的那样。

S1 <- read.table(text = "ID    Chr    Position    DP1    DP2
A1    1       123        1.5    2.0
A2    1       124        1.4    0.3", header = TRUE)

S2 <- read.table(text = "ID    Chr    Position    DP1    DP2
A1    1       123        1.2    2.0
A2    1       124        1.0    0.5", header = TRUE)

S3 <- read.table(text = "ID    Chr    Position    DP1    DP2
A1    1       123        1.5    2.1
A2    1       124        0.5    0.05", header = TRUE)

datlist <- list(S1 = S1, S2 = S2, S3 = S3)

现在加载我们要使用的包

library("dplyr")
library("tidyr")

通过混合使用 dplyr 和 tidyr 函数，我们可以获得您想要的结果：

## First, combine the list into a single data frame, adding a column to indicate
## which file each row came from
bind_rows(datlist, .id = "file") %>%
  ## Gather this into a longer format with DP1/DP2 as variables
  gather(key = col, value = value, which(!names(.) %in% c("ID", "Chr", "Position", "file"))) %>%
  ## Create a new column that combines the file name and DP1/DP2 -- this will be
  ## the final column names
  unite(newcol, file, col, sep = ".") %>%
  ## Spread the data so that each combination of file and DP1/DP2 is its own
  ## column
  spread(newcol, value)

最终结果：

## Source: local data frame [2 x 9]

##       ID   Chr Position S1.DP1 S1.DP2 S2.DP1 S2.DP2 S3.DP1 S3.DP2
##   (fctr) (int)    (int)  (dbl)  (dbl)  (dbl)  (dbl)  (dbl)  (dbl)
## 1     A1     1      123    1.5    2.0    1.2    2.0    1.5   2.10
## 2     A2     1      124    1.4    0.3    1.0    0.5    0.5   0.05

【讨论】：

【解决方案3】：

带底座 R 的单衬里

l = list(S1=S1, S2=S2, S3=S3)

idx = c("ID","Chr","Position")

d <- Reduce(function(x, y) merge(x, y, by = idx), l)

更新

忘记了变量名。这可能有点过分，但这是我能想到的避免硬编码名称的最佳方法。

 n <- expand.grid(names(l), setdiff(names(S1), idx))
 names(d)[!names(d)%in%idx] <- paste(n[ ,1], n[ ,2], sep = ".")

【讨论】：

为什么要给出两个单独的答案？最好将它们包含在一个 imo 中