【问题标题】:Merge multiple tables by row and column in R在R中按行和列合并多个表
【发布时间】:2016-08-22 16:16:50
【问题描述】:

假设我有三个重叠的表。

    A   B   C   D
A   12  16  17  14
B   62  66  9   85
C   37  31  59  75
D   74  76  89  25

    A   B   E   F
A   12  16  11  19
B   62  66  57  28
E   24  21  4   51
F   7   1   68  22

    C   D   E   F
C   59  75  77  80
D   89  25  88  30
E   67  87  4   51
F   39  69  68  22

我想按行和按列组合它们,没有任何重复的行或列,并且行和列名保持不变。

    A   B   C   D   E   F
A   12  16  17  14  11  19
B   62  66  9   85  57  28
C   37  31  59  75  77  80
D   74  76  89  25  88  30
E   24  21  67  87  4   51
F   7   1   39  69  68  22

三天后,我设法拼凑起来(在 hereherehere 以及我可能忘记的其他人的帮助下):

#Import tables as dataframes
file.names <- dir(pattern = ".tab")
for(i in 1:length(file.names)){
  nam <- paste("table.", i, sep = "")  #rename the data as table.1 ... table.n
  assign(nam, as.data.frame(as.matrix(read.delim(file.names[i],
         row.names=1, header=TRUE, sep="\t", stringsAsFactors=FALSE))))
}

#Import an empty file (i.e. just column and row names) 
#that you will fill with your smaller data tables
out.file <- as.data.frame(as.matrix(read.delim("Blank_table.csv",
                                               row.names=1, header=TRUE, sep=",")))

#Create a list of the dataframes
file.names = lapply(ls(pattern = "table.[0-9]"), get)

#Add columns that we can use for merging
#because using 'merge' on dataframes destroys row names
out.file$rows <- rownames(out.file)
for(i in 1:length(file.names)){
  rownams <- rownames(file.names[[i]])
  file.names[i] <- lapply(file.names[i], cbind, rows = rownams)
}

#Combine the tables
for(i in 1:length(file.names)){
  file <- file.names[i]
  out.file <- aggregate(. ~ rows, data = merge(out.file, file, all = TRUE),
                        na.action = na.pass, FUN = mean, na.rm = TRUE)
}

这是我想要的,但是当我合并数百个表时需要很长时间。我觉得可能有一种更简单的方法可以做到这一点,但我不想再花三天的时间反复试验才能到达那里。

我的想象是这样的:

  1. 将空表n次导入数据框列表
  2. 导入数据表并将每个数据表合并到列表中的一个空数据框中
  3. 创建一个新数据框,它是所有导入数据框中相应单元格的平均值

有什么建议吗?

更新:这是我来自dput的示例表:

table.1 <- structure(list(A = c(12L, 62L, 37L, 74L), B = c(16L, 66L, 31L, 
76L), C = c(17L, 9L, 59L, 89L), D = c(14L, 85L, 75L, 25L)), .Names = c("A", 
"B", "C", "D"), row.names = c("A", "B", "C", "D"), class = "data.frame")

table.2 <- structure(list(A = c(12L, 62L, 24L, 7L), B = c(16L, 66L, 21L, 
1L), E = c(11L, 57L, 4L, 68L), F = c(19L, 28L, 51L, 22L)), .Names = c("A", 
"B", "E", "F"), row.names = c("A", "B", "E", "F"), class = "data.frame")

table.3 <- structure(list(C = c(59L, 89L, 67L, 39L), D = c(75L, 25L, 87L, 
69L), E = c(77L, 88L, 4L, 68L), F = c(80L, 30L, 51L, 24L)), .Names = c("C", 
"D", "E", "F"), row.names = c("C", "D", "E", "F"), class = "data.frame")

out.file <- structure(list(A = c(NA, NA, NA, NA, NA, NA), B = c(NA, NA, NA, 
NA, NA, NA), C = c(NA, NA, NA, NA, NA, NA), D = c(NA, NA, NA, 
NA, NA, NA), E = c(NA, NA, NA, NA, NA, NA), F = c(NA, NA, NA, 
NA, NA, NA)), .Names = c("A", "B", "C", "D", "E", "F"), row.names = c("A", 
"B", "C", "D", "E", "F"), class = "data.frame")

【问题讨论】:

  • 你有什么错误吗?
  • 请使用dput分享您的数据。
  • @pableiros 不。我没有收到任何错误。

标签: r merge


【解决方案1】:

子集解决方案,无需额外的包(使用@emehex 定义的df1、df2 和df3):

# List of dataframes to combine
DF<-list(df1, df2, df3)

COL<-unique(unlist(lapply(DF, colnames)))
ROW<-unique(unlist(lapply(DF, rownames)))
# Empty DF with all combinations
TOTAL<-matrix(data=NA, nrow=length(ROW), ncol=length(COL), dimnames=list(ROW, COL))
# Subsetting :
for (df in DF) { 
    TOTAL[rownames(df), colnames(df)] <- as.matrix(df)
}

子集比合并更快,有大量数据帧可能更有效(请参阅@aichao 回答她:For each row extract the value in the column name that match another value in the cell)。您只需将DF 列表调整为file.names 即可用于您的代码。

【讨论】:

  • 我已经用示例数据运行了它,它看起来很漂亮。我会让你知道它如何处理我的数百个真实数据表。
  • 使用我之前发布的方法,合并1275个十乘十的表大约需要半个小时。您的子集解决方案在大约 1 分钟内完成。太棒了!
  • 太棒了!很高兴能帮到你
【解决方案2】:

不知道你的 .csvs 是什么样子,所以这是我能做的最好的(上面的三个示例表)...

数据导入

df1 <- read.table(header = TRUE, text = 
"A   B   C   D
A   12  16  17  14
B   62  66  9   85
C   37  31  59  75
D   74  76  89  25")

df2 <- read.table(header = TRUE, text = 
"A   B   E   F
A   12  16  11  19
B   62  66  57  28
E   24  21  4   51
F   7   1   68  22")

df3 <- read.table(header = TRUE, text = 
"C   D   E   F
C   59  75  77  80
D   89  25  88  30
E   67  87  4   51
F   39  69  68  22")

dplyrtibbletidyr 的解决方案

library(dplyr)
library(tibble)
library(tidyr)

# intermediate tables for rownames and gathering
df1_c <- df1 %>% 
    rownames_to_column("Name") %>% 
    gather(key, value, -Name)

df2_c <- df2 %>% 
    rownames_to_column("Name") %>% 
    gather(key, value, -Name)

df3_c <- df3 %>% 
    rownames_to_column("Name") %>% 
    gather(key, value, -Name)

# formatted dataframe from spread
df <- bind_rows(df1_c, df2_c, df3_c) %>% 
    group_by(Name, key) %>% 
    distinct(.keep_all = TRUE) %>% 
    spread(key, value)

输出

df
#    Name     A     B     C     D     E     F
# * <chr> <int> <int> <int> <int> <int> <int>
# 1     A    12    16    17    14    11    19
# 2     B    62    66     9    85    57    28
# 3     C    37    31    59    75    77    80
# 4     D    74    76    89    25    88    30
# 5     E    24    21    67    87     4    51
# 6     F     7     1    39    69    68    22

【讨论】:

    猜你喜欢
    • 2015-04-26
    • 1970-01-01
    • 2018-09-20
    • 1970-01-01
    • 1970-01-01
    • 2012-03-20
    • 1970-01-01
    • 2021-11-29
    • 1970-01-01
    相关资源
    最近更新 更多