【问题标题】:How to rbind() / dplyr::bind_rows() / data.table::rbindlist() data frames which contain data frame columns?如何 rbind() / dplyr::bind_rows() / data.table::rbindlist() 包含数据框列的数据框?
【发布时间】:2020-01-02 23:28:14
【问题描述】:

base R、dplyr 和 data.table 无法重新绑定包含数据框列的数据框:

x <- data.frame(a=1)
x$b <- data.frame(z=2)
y <- data.frame(a=3)
y$b <- data.frame(z=4)

# base and dplyr fail
rbind(x, y)
#> Warning: non-unique value when setting 'row.names': '1'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
dplyr::bind_rows(x,y)
#> Error: Argument 2 can't be a list containing data frames

# data.table gives a result that doesn't make much sense to me
str(data.table::rbindlist(list(x,y)))
#> Warning in setDT(ans): Some columns are a multi-column type (such as a matrix
#> column): [2]. setDT will retain these columns as-is but subsequent operations
#> like grouping and joining may fail. Please consider as.data.table() instead
#> which will create a new column for each embedded column.
#> Classes 'data.table' and 'data.frame':   2 obs. of  2 variables:
#>  $ a: num  1 3
#>  $ b:'data.frame':   1 obs. of  2 variables:
#>   ..$ : num 2
#>   ..$ : num 4
#>  - attr(*, ".internal.selfref")=<externalptr>

reprex package (v0.3.0) 于 2020-01-03 创建

我的预期输出是 rbind 数据框列,所以我们最终会得到类似 res 的内容:

res <- data.frame(a= c(1,3))
res$b <- data.frame(z = c(3,4))
res
#>   a z
#> 1 1 3
#> 2 3 4
str(res)
#> 'data.frame':    2 obs. of  2 variables:
#>  $ a: num  1 3
#>  $ b:'data.frame':   2 obs. of  1 variable:
#>   ..$ z: num  3 4

我该如何解决这个问题?

【问题讨论】:

  • 如果我弄错了,请纠正我,但您希望您的预期结果是:res$b &lt;- data.frame(z = c(2,4))?
  • @Ali 我希望我的编辑能阐明我想要什么

标签: r dplyr data.table


【解决方案1】:

我们可以将数据框列与常规列分开绑定,这里有 3 个类似的解决方案,包装了问题中提到的 3 个函数:

基础R

rbind_fixed <- function(...){
  dfs <- list(...)
  # get all names of data.frame columns
  get_df_col_ind <- function(df) sapply(df, is.data.frame)
  df_col_names_list <- lapply(dfs, function(df) names(df[get_df_col_ind(df)]))
  df_col_names <- unique(do.call(c,df_col_names_list))
  # fail if these are not consistently data frames in all arguments
  for(df_col_name in df_col_names) {
    for(df in dfs){
      if(!is.null(df[[df_col_name]]) && !is.data.frame(df[[df_col_name]]))
        stop(df_col_name, "is not consistently a data frame column")
    }
  }
  # bind data frames, except for data frame columns
  dfs_regular <- lapply(dfs, function(df) df[setdiff(names(df), df_col_names)])
  res <- do.call(rbind, dfs_regular)
  # bind data frame columns separately and add them to the result
  for(df_col_name in df_col_names) {
    subdfs <- lapply(dfs, function(df) {
      if(df_col_name %in% names(df)) df[[df_col_name]] else
        data.frame(row.names = seq.int(nrow(df)))
    })
    # recursive to be robust in case of deep nested data frames 
    res[[df_col_name]] <- do.call(rbind_fixed, subdfs)
  }
  res
}
rbind_fixed(x, y)
#>   a z
#> 1 1 2
#> 2 3 4

dplyr

bind_rows_fixed <- function(...){
  # use list2() so we can use `!!!`, as we lose the "autosplice" feature of bind_rows
  dfs <- rlang::list2(...)
  # get all names of data.frame columns
  get_df_col_ind <- function(df) sapply(df, is.data.frame)
  df_col_names_list <- lapply(dfs, function(df) names(df[get_df_col_ind(df)]))
  df_col_names <- unique(do.call(c,df_col_names_list))
  # fail if these are not consistently data frames in all arguments
  for(df_col_name in df_col_names) {
    for(df in dfs){
      if(!is.null(df[[df_col_name]]) && !is.data.frame(df[[df_col_name]]))
        stop(df_col_name, "is not consistently a data frame column")
    }
  }
  # bind data frames, except for data frame columns
  dfs_regular <- lapply(dfs, function(df) df[setdiff(names(df), df_col_names)])
  res <- dplyr::bind_rows(dfs_regular)
  # bind data frame columns separately and add them to the result
  for(df_col_name in df_col_names) {
    subdfs <- lapply(dfs, function(df) {
      if(df_col_name %in% names(df)) df[[df_col_name]] else
        tibble(.rows = nrow(df))
    })

    # recursive to be robust in case of deep nested data frames 
    res[[df_col_name]] <- bind_rows_fixed(!!!subdfs)
  }
  res
}
bind_rows_fixed(x,y)
#>   a z
#> 1 1 2
#> 2 3 4

data.table

rbindlist_fixed <- function(l){
  dfs <- l
  # get all names of data.frame columns
  get_df_col_ind <- function(df) sapply(df, is.data.frame)
  df_col_names_list <- lapply(dfs, function(df) names(df[get_df_col_ind(df)]))
  df_col_names <- unique(do.call(c,df_col_names_list))
  # fail if these are not consistently data frames in all arguments
  for(df_col_name in df_col_names) {
    for(df in dfs){
      if(!is.null(df[[df_col_name]]) && !is.data.frame(df[[df_col_name]]))
        stop(df_col_name, "is not consistently a data frame column")
    }
  }
  # bind data frames, except for data frame columns
  dfs_regular <- lapply(dfs, function(df) df[setdiff(names(df), df_col_names)])
  res <- data.table::rbindlist(dfs_regular)
  # bind data frame columns separately and add them to the result
  for(df_col_name in df_col_names) {
    subdfs <- lapply(dfs, function(df) {
      if(df_col_name %in% names(df)) df[[df_col_name]] else
        data.frame(row.names = seq.int(nrow(df)))
    })
    # recursive to be robust in case of deep nested data frames 
    res[[df_col_name]] <- rbindlist_fixed(subdfs)
  }
  res
}
dt <- rbindlist_fixed(list(x,y))
dt
#>    a              b
#> 1: 1 <multi-column>
#> 2: 3 <multi-column>
str(dt)
#> Classes 'data.table' and 'data.frame':   2 obs. of  2 variables:
#>  $ a: num  1 3
#>  $ b:Classes 'data.table' and 'data.frame':  2 obs. of  1 variable:
#>   ..$ z: num  2 4
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>  - attr(*, ".internal.selfref")=<externalptr>

【讨论】:

  • 我有多个时间序列数据,单个时间序列包含 30 年的数据,并且是长格式。我总共有 965 个时间序列(每个 30 年)和总共 28950 个观察值或行(965 * 30),我想创建一个在每个时间序列中运行的循环,我想划分时间序列(30年的数据)在单个时间序列中进入训练集(其中有 21 年)和测试集(剩余 9 年),我想让这个循环在所有 965 个时间序列中以相同的方式运行。我怎么能做一个非常合理的for循环或向量化函数? Thans@Moo
  • 嗨@Stackuser,这看起来是一个值得自己输入的问题!
【解决方案2】:

问题似乎是bind 函数与数据框b inside x/y 的行名有问题。我们可以在基本 R 中通过重命名行来避免这种情况(见下文)。

重要提示:dplyr 现在可以处理这个例子了。不再需要解决方法。

# Setup
x <- data.frame(a=1)
x$b <- data.frame(z=2)
y <- data.frame(a=3)
y$b <- data.frame(z=4)

rbind(x, y) # still does not work
#> Warning: non-unique value when setting 'row.names': '1'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
require(dplyr)
dplyr::bind_rows(x,y) # works!!!
#>   a z
#> 1 1 2
#> 2 3 4


# Avoid conflicting row names
row.names(x)   <- seq(nrow(y)+1, nrow(y)+nrow(x))
row.names(x$b) <- seq(nrow(y)+1, nrow(y)+nrow(x))

rbind(x, y) #works now, too
#>   a z
#> 2 1 2
#> 1 3 4

reprex package (v0.3.0) 于 2020 年 6 月 27 日创建

【讨论】:

    【解决方案3】:

    为了清楚起见,添加一个新的答案,我们可以期待 bind_rows() 将来支持数据框列,但同时我们可以使用 vctrs::vec_rbind(),正如 Romain François 在 https://github.com/tidyverse/dplyr/issues/4226 中建议的那样。

    x <- data.frame(a=1)
    x$b <- data.frame(z=2)
    y <- data.frame(a=3)
    y$b <- data.frame(z=4)
    
    res <- vctrs::vec_rbind(x,y)
    
    res
    #>   a z
    #> 1 1 2
    #> 2 3 4
    
    str(res)
    #> 'data.frame':    2 obs. of  2 variables:
    #>  $ a: num  1 3
    #>  $ b:'data.frame':   2 obs. of  1 variable:
    #>   ..$ z: num  2 4
    

    reprex package (v0.3.0) 于 2020-01-06 创建

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-12-05
      • 1970-01-01
      • 1970-01-01
      • 2019-02-21
      • 1970-01-01
      相关资源
      最近更新 更多