如何 rbind() / dplyr::bind_rows() / data.table::rbindlist() 包含数据框列的数据框？答案

【问题标题】：How to rbind() / dplyr::bind_rows() / data.table::rbindlist() data frames which contain data frame columns?如何 rbind() / dplyr::bind_rows() / data.table::rbindlist() 包含数据框列的数据框？
【发布时间】：2020-01-02 23:28:14
【问题描述】：

base R、dplyr 和 data.table 无法重新绑定包含数据框列的数据框：

x <- data.frame(a=1)
x$b <- data.frame(z=2)
y <- data.frame(a=3)
y$b <- data.frame(z=4)

# base and dplyr fail
rbind(x, y)
#> Warning: non-unique value when setting 'row.names': '1'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
dplyr::bind_rows(x,y)
#> Error: Argument 2 can't be a list containing data frames

# data.table gives a result that doesn't make much sense to me
str(data.table::rbindlist(list(x,y)))
#> Warning in setDT(ans): Some columns are a multi-column type (such as a matrix
#> column): [2]. setDT will retain these columns as-is but subsequent operations
#> like grouping and joining may fail. Please consider as.data.table() instead
#> which will create a new column for each embedded column.
#> Classes 'data.table' and 'data.frame':   2 obs. of  2 variables:
#>  $ a: num  1 3
#>  $ b:'data.frame':   1 obs. of  2 variables:
#>   ..$ : num 2
#>   ..$ : num 4
#>  - attr(*, ".internal.selfref")=<externalptr>

^{由reprex package (v0.3.0) 于 2020-01-03 创建}

我的预期输出是 rbind 数据框列，所以我们最终会得到类似 res 的内容：

res <- data.frame(a= c(1,3))
res$b <- data.frame(z = c(3,4))
res
#>   a z
#> 1 1 3
#> 2 3 4
str(res)
#> 'data.frame':    2 obs. of  2 variables:
#>  $ a: num  1 3
#>  $ b:'data.frame':   2 obs. of  1 variable:
#>   ..$ z: num  3 4

我该如何解决这个问题？

【问题讨论】：

如果我弄错了，请纠正我，但您希望您的预期结果是：res$b <- data.frame(z = c(2,4))?
@Ali 我希望我的编辑能阐明我想要什么

标签： r dplyr data.table

【解决方案1】：

我们可以将数据框列与常规列分开绑定，这里有 3 个类似的解决方案，包装了问题中提到的 3 个函数：

基础R

rbind_fixed <- function(...){
  dfs <- list(...)
  # get all names of data.frame columns
  get_df_col_ind <- function(df) sapply(df, is.data.frame)
  df_col_names_list <- lapply(dfs, function(df) names(df[get_df_col_ind(df)]))
  df_col_names <- unique(do.call(c,df_col_names_list))
  # fail if these are not consistently data frames in all arguments
  for(df_col_name in df_col_names) {
    for(df in dfs){
      if(!is.null(df[[df_col_name]]) && !is.data.frame(df[[df_col_name]]))
        stop(df_col_name, "is not consistently a data frame column")
    }
  }
  # bind data frames, except for data frame columns
  dfs_regular <- lapply(dfs, function(df) df[setdiff(names(df), df_col_names)])
  res <- do.call(rbind, dfs_regular)
  # bind data frame columns separately and add them to the result
  for(df_col_name in df_col_names) {
    subdfs <- lapply(dfs, function(df) {
      if(df_col_name %in% names(df)) df[[df_col_name]] else
        data.frame(row.names = seq.int(nrow(df)))
    })
    # recursive to be robust in case of deep nested data frames 
    res[[df_col_name]] <- do.call(rbind_fixed, subdfs)
  }
  res
}
rbind_fixed(x, y)
#>   a z
#> 1 1 2
#> 2 3 4

dplyr

bind_rows_fixed <- function(...){
  # use list2() so we can use `!!!`, as we lose the "autosplice" feature of bind_rows
  dfs <- rlang::list2(...)
  # get all names of data.frame columns
  get_df_col_ind <- function(df) sapply(df, is.data.frame)
  df_col_names_list <- lapply(dfs, function(df) names(df[get_df_col_ind(df)]))
  df_col_names <- unique(do.call(c,df_col_names_list))
  # fail if these are not consistently data frames in all arguments
  for(df_col_name in df_col_names) {
    for(df in dfs){
      if(!is.null(df[[df_col_name]]) && !is.data.frame(df[[df_col_name]]))
        stop(df_col_name, "is not consistently a data frame column")
    }
  }
  # bind data frames, except for data frame columns
  dfs_regular <- lapply(dfs, function(df) df[setdiff(names(df), df_col_names)])
  res <- dplyr::bind_rows(dfs_regular)
  # bind data frame columns separately and add them to the result
  for(df_col_name in df_col_names) {
    subdfs <- lapply(dfs, function(df) {
      if(df_col_name %in% names(df)) df[[df_col_name]] else
        tibble(.rows = nrow(df))
    })

    # recursive to be robust in case of deep nested data frames 
    res[[df_col_name]] <- bind_rows_fixed(!!!subdfs)
  }
  res
}
bind_rows_fixed(x,y)
#>   a z
#> 1 1 2
#> 2 3 4

data.table

rbindlist_fixed <- function(l){
  dfs <- l
  # get all names of data.frame columns
  get_df_col_ind <- function(df) sapply(df, is.data.frame)
  df_col_names_list <- lapply(dfs, function(df) names(df[get_df_col_ind(df)]))
  df_col_names <- unique(do.call(c,df_col_names_list))
  # fail if these are not consistently data frames in all arguments
  for(df_col_name in df_col_names) {
    for(df in dfs){
      if(!is.null(df[[df_col_name]]) && !is.data.frame(df[[df_col_name]]))
        stop(df_col_name, "is not consistently a data frame column")
    }
  }
  # bind data frames, except for data frame columns
  dfs_regular <- lapply(dfs, function(df) df[setdiff(names(df), df_col_names)])
  res <- data.table::rbindlist(dfs_regular)
  # bind data frame columns separately and add them to the result
  for(df_col_name in df_col_names) {
    subdfs <- lapply(dfs, function(df) {
      if(df_col_name %in% names(df)) df[[df_col_name]] else
        data.frame(row.names = seq.int(nrow(df)))
    })
    # recursive to be robust in case of deep nested data frames 
    res[[df_col_name]] <- rbindlist_fixed(subdfs)
  }
  res
}
dt <- rbindlist_fixed(list(x,y))
dt
#>    a              b
#> 1: 1 <multi-column>
#> 2: 3 <multi-column>
str(dt)
#> Classes 'data.table' and 'data.frame':   2 obs. of  2 variables:
#>  $ a: num  1 3
#>  $ b:Classes 'data.table' and 'data.frame':  2 obs. of  1 variable:
#>   ..$ z: num  2 4
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>  - attr(*, ".internal.selfref")=<externalptr>

【讨论】：

我有多个时间序列数据，单个时间序列包含 30 年的数据，并且是长格式。我总共有 965 个时间序列（每个 30 年）和总共 28950 个观察值或行（965 * 30），我想创建一个在每个时间序列中运行的循环，我想划分时间序列（30年的数据）在单个时间序列中进入训练集（其中有 21 年）和测试集（剩余 9 年），我想让这个循环在所有 965 个时间序列中以相同的方式运行。我怎么能做一个非常合理的for循环或向量化函数？ Thans@Moo
嗨@Stackuser，这看起来是一个值得自己输入的问题！

【解决方案2】：

问题似乎是bind 函数与数据框b inside x/y 的行名有问题。我们可以在基本 R 中通过重命名行来避免这种情况（见下文）。

重要提示：dplyr 现在可以处理这个例子了。不再需要解决方法。

# Setup
x <- data.frame(a=1)
x$b <- data.frame(z=2)
y <- data.frame(a=3)
y$b <- data.frame(z=4)

rbind(x, y) # still does not work
#> Warning: non-unique value when setting 'row.names': '1'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
require(dplyr)
dplyr::bind_rows(x,y) # works!!!
#>   a z
#> 1 1 2
#> 2 3 4


# Avoid conflicting row names
row.names(x)   <- seq(nrow(y)+1, nrow(y)+nrow(x))
row.names(x$b) <- seq(nrow(y)+1, nrow(y)+nrow(x))

rbind(x, y) #works now, too
#>   a z
#> 2 1 2
#> 1 3 4

^{由reprex package (v0.3.0) 于 2020 年 6 月 27 日创建}

【讨论】：

【解决方案3】：

为了清楚起见，添加一个新的答案，我们可以期待 bind_rows() 将来支持数据框列，但同时我们可以使用 vctrs::vec_rbind()，正如 Romain François 在 https://github.com/tidyverse/dplyr/issues/4226 中建议的那样。

x <- data.frame(a=1)
x$b <- data.frame(z=2)
y <- data.frame(a=3)
y$b <- data.frame(z=4)

res <- vctrs::vec_rbind(x,y)

res
#>   a z
#> 1 1 2
#> 2 3 4

str(res)
#> 'data.frame':    2 obs. of  2 variables:
#>  $ a: num  1 3
#>  $ b:'data.frame':   2 obs. of  1 variable:
#>   ..$ z: num  2 4

^{由reprex package (v0.3.0) 于 2020-01-06 创建}

【讨论】：