【问题标题】:Fastest way to select multiple elements from a list to build a dataframe从列表中选择多个元素以构建数据框的最快方法
【发布时间】:2023-04-03 12:59:01
【问题描述】:

我有一个包含多个 data.frames 的列表。我想从列表中选择每个第 n 个 data.frame 并将它们组合成一个可以写入 csv 的 data.frame。

这里是一个列表结构的例子:

one.title <- data.frame(id = '1a', title = 'first title')

one.author <- data.frame(first_name = c('Susan', 'Alice'),
                     last_name  = c('Smith', 'Johnson') )

second.title <- data.frame(id = '2b', title = 'second_title')

second.author <- data.frame(first_name = c('Sarah', 'Mary'),
                        last_name  = c('Davis', 'Proctor') )

one.list <- list()

one.list[[1]]$title <- one.title
one.list[[1]]$author <- one.author
one.list[[2]]$title <- second.title
one.list[[2]]$author <- second.author

这是我当前为“作者”字段生成单个数据框的解决方案:

build_author_table <- function(result.l){

  list_to_df <- function(i){

  x <- result.l[[i]]$author

  return(x)
}


authors_df_l <-(lapply(1:length(result.l), FUN = list_to_df))

authors_df <- do.call("rbind", lapply(authors_df_l, as.data.frame))

return(authors_df)
}

这会产生我想要的输出:

    first_name last_name
1      Susan     Smith
2      Alice   Johnson
3      Sarah     Davis
4       Mary   Proctor

但正如您可能想象的那样,当扩展到 data.frame 中具有更大文本字段的数千条记录时,它会非常缓慢。

谁能建议一种更快、更有效的方法来生成最终的 data.frame?

【问题讨论】:

    标签: r performance


    【解决方案1】:

    您的构建代码不起作用,但我构建了一个我认为与您正在拍摄的内容相似的代码。

    List of 2
     $ :List of 2
      ..$ title :'data.frame':  1 obs. of  2 variables:
      .. ..$ id   : Factor w/ 1 level "1a": 1
      .. ..$ title: Factor w/ 1 level "first title": 1
      ..$ author:'data.frame':  2 obs. of  2 variables:
      .. ..$ first_name: Factor w/ 2 levels "Alice","Susan": 2 1
      .. ..$ last_name : Factor w/ 2 levels "Johnson","Smith": 2 1
     $ :List of 2
      ..$ title :'data.frame':  1 obs. of  2 variables:
      .. ..$ id   : Factor w/ 1 level "2b": 1
      .. ..$ title: Factor w/ 1 level "second_title": 1
      ..$ author:'data.frame':  2 obs. of  2 variables:
      .. ..$ first_name: Factor w/ 2 levels "Mary","Sarah": 2 1
      .. ..$ last_name : Factor w/ 2 levels "Davis","Proctor": 1 2
    

    如果这是您的想法,那么它的效果非常好,您确实会收到警告,因为字符串是因素。这些可以忽略,或者在构建初始数据框时使用 stringAsFactors = F 作为参数

    library(purrr) 
    map_dfr(one.list, "author")
    

    【讨论】:

      【解决方案2】:

      这是一个更好的解决方案(基准):

      data.table::rbindlist(lapply(one.list, "[[", "author"))
      

      purr 解决方案很漂亮,但没那么快。基准测试结果:

      microbenchmark(build_author_table(one.list),
          data.table::rbindlist(lapply(one.list, "[[", "author")),
          map_dfr(one.list, "author"))
      
      Unit: microseconds
                                                          expr     min       lq      mean   median       uq        max neval cld
                                  build_author_table(one.list) 170.693 190.9460  239.2987 206.4505 272.3815    494.477   100   a
       data.table::rbindlist(lapply(one.list, "[[", "author"))  69.562  88.5590  270.4926  99.1750 152.6735  15068.116   100   a
                                   map_dfr(one.list, "author") 214.832 245.2825 2374.5980 281.3210 340.1270 206562.846   100   a
      

      【讨论】:

        【解决方案3】:

        试试这个:

        
        
        one.title <- data.frame(id = '1a', title = 'first title')
        
        one.author <- data.frame(first_name = c('Susan', 'Alice'),
                                 last_name  = c('Smith', 'Johnson') )
        
        second.title <- data.frame(id = '2b', title = 'second_title')
        
        second.author <- data.frame(first_name = c('Sarah', 'Mary'),
                                    last_name  = c('Davis', 'Proctor') )
        
        one.list <- list(
          list(title = one.title, author =  one.author),
          list(title = second.title, author =  second.author)
        )
        
        
        
        authors_df_l = lapply(one.list, function(item) item$author)
        
        do.call("rbind",authors_df_l)
        

        【讨论】:

          猜你喜欢
          • 2018-08-17
          • 2018-12-05
          • 2021-07-16
          • 2016-08-11
          • 2012-08-20
          • 2018-07-14
          • 1970-01-01
          • 2021-11-10
          • 2017-06-08
          相关资源
          最近更新 更多