【问题标题】:Converting a list of data frames: not a simple rbind, second row to new columns转换数据框列表:不是简单的 rbind,第二行到新列
【发布时间】:2018-06-12 01:07:44
【问题描述】:

转换数据框列表:不是简单的 rbind,第二行到新列

我有一个清单

employeesList = list(data.frame(first = ("Al"), second = "Jones"), 
                 data.frame(first = c("Al", "Barb"), second = c("Jones",       "Smith")),
             data.frame(first = c("Al", "Barb", "Carol"), second =   c("Jones", "Smith", "Adams")),
             data.frame(first = ("Al"), second = "Jones"))

我正在寻找制作这个。

employeesDF = data.frame(first = c("Al", "Al", "Al", "Al"), second = c("Jones", "Jones", "Jones", "Jones"),
                        first2 = c(NA, "Barb", "Barb", NA), second2 = c(NA, "Smith", "Smith", NA),
                        first3 = c(NA, NA, "Carol", NA), second3 = c(NA, NA, "Adams", NA))

我希望每个数据框成为结果数据框中的一行。 请注意,转换后的第一个数据帧将有两列,转换后的第二个 df 将有四列,转换后的第三个 df 将产生 6 列,转换后的第四个 df 将产生 2 列,依此类推。我意识到必须填写 NA 值 我已经做了一些研究,如果目标只是 rbind,问题就解决了。我看不到解决我的问题的方法。 Convert a list of data frames into one data frame

读完之后, Combine two data frames by rows (rbind) when they have different sets of columns 我开始了

res1 = cbind(t(employeesList[[1]][1]), t(employeesList[[1]][2]))
res2 = cbind(t(employeesList[[2]][1]), t(employeesList[[2]][2]))
res3 = cbind(t(employeesList[[3]][1]), t(employeesList[[3]][2]))
res4 = cbind(t(employeesList[[4]][1]), t(employeesList[[4]][2]))

然后

library(dplyr)
bind_rows(list(res1, res2, res3, res4))

但我可能有大量的数据帧——res1、...、resn。列数未预先指定,但可能会少于 10。我的程序没有命名列,我认为 bind_rows 需要这样做。

【问题讨论】:

    标签: r dplyr bind rows


    【解决方案1】:

    我们可以使用lapply 并将列表转换为单行数据框,然后使用bind_rows 将多个数据框绑定在一起。

    library(dplyr)
    bind_rows(lapply(employeesList, function(x) rbind.data.frame(c(t(x)))))
    
    #   X.Al. X.Jones. X.Barb. X.Smith. X.Carol. X.Adams.
    #1    Al    Jones    <NA>     <NA>     <NA>     <NA>
    #2    Al    Jones    Barb    Smith     <NA>     <NA>
    #3    Al    Jones    Barb    Smith    Carol    Adams
    #4    Al    Jones    <NA>     <NA>     <NA>     <NA>
    

    我们可以稍后根据自己的喜好使用setNames 重命名列。

    【讨论】:

    • 我决定定时测试这三种方法。这种方法显然是赢家。使用此链接,我创建了数据框列表。数据框有一、二或三行。这三种类型中的每一种都有 10,000 个数据帧。 stackoverflow.com/questions/17499013/… Method 1 dplyr # user system elapsed # 44.94 0.00 45.03 Method 2 data.table # user system elapsed # 225.36 4.68 229.61 Method 3 tidyverse # user system elapsed # 513.72 0.23 519.52
    【解决方案2】:

    可能是这样使用 data.table

    library('data.table')
    
    rbindlist( l = lapply( employeesList, function(x) {
                      dcast( data    = melt( setDT( x ), measure.vars = c( 'first', 'second'))[, V1 := seq_along(value), by = variable][],
                             formula = " . ~ variable + V1")[, -1]
                    }), 
               fill = TRUE, 
               use.names = TRUE )
    
    #    first_1 second_1 first_2 second_2 first_3 second_3
    # 1:      Al    Jones      NA       NA      NA       NA
    # 2:      Al    Jones    Barb    Smith      NA       NA
    # 3:      Al    Jones    Barb    Smith   Carol    Adams
    # 4:      Al    Jones      NA       NA      NA       NA
    

    【讨论】:

      【解决方案3】:

      这是gather/spread的一个选项

      library(tidyverse)
      employeesList %>% 
         map_df(~ .x %>% 
                     mutate_all(as.character) %>% # convert columns to character class
                     mutate(n = row_number(), n = replace(n, n==1, "")),
                   .id = 'grp') %>% 
        group_by(grp) %>%
        gather(key, val, first:second) %>% # gather to long format
        arrange(grp, n) %>% 
        unite(keyn, key, n, sep="") %>% # unite columns to create new column
        ungroup %>% 
        mutate(keyn = factor(keyn, levels = unique(keyn))) %>% # for column order
        spread(keyn, val) %>% # spread to wide format
        select(-grp)
      # A tibble: 4 x 6
      #  first second first2 second2 first3 second3
      #  <chr> <chr>  <chr>  <chr>   <chr>  <chr>  
      #1 Al    Jones  NA     NA      NA     NA     
      #2 Al    Jones  Barb   Smith   NA     NA     
      #3 Al    Jones  Barb   Smith   Carol  Adams  
      #4 Al    Jones  NA     NA      NA     NA     
      

      【讨论】:

        【解决方案4】:

        我在提交示例数据时犯了一个错误。它在两种不同的方面不够普遍。 列名的变化可能不一致,数据的变化可能比指示的要大得多。 然后我在 r-help 上问了我的问题。那里以多种方式回答了这个问题。 以下是其他人根据我的时序研究创建的解决方案。

        # input data (list of data frames and data frames may have multiple rows)
        employees4List = list(data.frame(first1 = "Al", second1 =
                                           "Jones"),
                              data.frame(first2 = c("Al2", "Barb"),
                                         second2 = c("Jones", "Smith")),
                              data.frame(first3 = c("Al3", "Barbara",
                                                    "Carol"),
                                         second3 = c("Jones", "Smith",
                                                     "Adams")),
                              data.frame(first4 = ("Al"), second4 =
                                           "Jones2"))
        employees4List
        
        # intermediate step (list of data frames with each just one row)
        df1 = data.frame(First1 = "Al", Second1 = "Jones",
                         First2 = NA, Second2 = NA,
                         First3 = NA, Second3 = NA,
                         First4 = NA, Second4 = NA)
        df2 = data.frame(First1 = "Al2", Second1 = "Jones",
                         First2 = "Barb", Second2 = "Smith",
                         First3 = NA, Second3 = NA,
                         First4 = NA, Second4 = NA)
        df3 = data.frame(First1 = "Al3", Second1 = "Jones",
                         First2 = "Barbara", Second2 = "Smith",
                         First3 = "Carol", Second3 = "Adams",
                         First4 = NA, Second4 = NA)
        df4 = data.frame(First1 = "Al", Second1 = "Jones2",
                         First2 = NA, Second2 = NA,
                         First3 = NA, Second3 = NA,
                         First4 = NA, Second4 = NA)
        listFinal = list(df1, df2, df3, df4)
        listFinal
        
        # Expected final step, except that all columns should be character
        # Just one data frame
        dplyr::bind_rows(listFinal)
        sapply(dplyr::bind_rows(listFinal), class)
        
        # Solution 1 using base R by Sarah Goslee
        
        dfbycol <- function(x) {
          x <- lapply(x, function(y)as.vector(t(as.matrix(y))))
          x <- lapply(x, function(y){length(y) <- max(sapply(x, length)); y})
          x <- do.call(rbind, x)
          x <- data.frame(x, stringsAsFactors=FALSE)
          colnames(x) <- paste0(c("first", "last"), rep(seq(1, ncol(x)/2), each=2))
          x
        }
        
        dfbycol(listFinal)
        
        ##########
        # Solution 2 by Jeff Newmiller (Base R)
        
        myrename2 <- function( DF, m ) {
          # if a pair of columns is not present, raise an error
          stopifnot( 2 == length( DF ) )
          n <- nrow( DF )
          # use memory layout of elements of matrix
          # t() automatically converts to matrix (nrow=2)
          # matrix(,nrow=1) re-interprets the column-major output of t()
          # as a single row matrix
          result <- as.data.frame( matrix( t( DF ), nrow = 1 )
                                   , stringsAsFactors = FALSE
          )
          if ( n < m ) {
            result[ , seq( 2 * n + 1, 2 * m ) ] <- NA
          }
          setNames( result
                    , sprintf( "%s%d"
                               , c( "First", "Second" )
                               , rep( seq.int( m ), each = 2 )
                    )
          )
        }
        
        m <- max( unlist( lapply( employees4List, nrow ) ) )
        listFinal2 <- lapply( employees4List, myrename2, m = m )
        listFinal2
        
        result2 <- do.call( rbind, listFinal2 )
        result2
        
        ##########
        # Solution 3 by Jeff Newmiller (uses dplyr)
        myrename3 <- function( DF ) {
          # if a pair of columns is not present, raise an error
          stopifnot( 2 == length( DF ) )
          n <- nrow( DF )
          # use memory layout of elements of matrix
          # t() automatically converts to matrix (nrow=2)
          # matrix(,nrow=1) re-interprets the column-major output of t()
          # as a single row matrix
          setNames( as.data.frame( matrix( t( DF ), nrow = 1 )
                                   , stringsAsFactors = FALSE
          )
          , sprintf( "%s%d"
                     , c( "First", "Second" )
                     , rep( seq.int( n ), each = 2 )
          )
          )
        }
        
        listFinal3 <- lapply( employees4List, myrename3 )
        listFinal3
        result3 <- dplyr::bind_rows( listFinal3 )
        result3
        
        # Solution 4 by Jeff Newmiller (uses dplyr and tidyr)
        
        library(dplyr)
        library(tidyr)
        myrename4 <- function( DF ) {
          # if a pair of columns is not present, raise an error
          stopifnot( 2 == length( DF ) )
          names( DF ) <- c( "a", "b" )
          m <- nrow( DF )
          (  DF
            %>% mutate_all( as.character )
            %>% mutate( rw = LETTERS[ seq.int( n() ) ] )
            %>% gather( col, val, -rw )
            %>% tidyr::unite( "labels", rw, col, sep="" )
            %>% spread( labels, val )
            %>% setNames( sprintf( "%s%d"
                                   , c( "First", "Second" )
                                   , rep( seq.int( m ), each = 2 )
            )
            )
          )
        }
        
        listFinal4 <- lapply( employees4List, myrename3)
        listFinal4
        result4 <- dplyr::bind_rows(listFinal4)
        result4
        
        #####
        # Timing
        # Create a large dataset
        firsts = c("Al", "Barb", "Carol")
        seconds = c("Washington", "Adams", "Jefferson" )
        numReplications = 10000
        set.seed(2018)
        
        # Create data frames
        sim_list1 = replicate(n = numReplications,
                              expr = {data.frame(first = base::sample(x = firsts, size = 1, replace = TRUE),
                                                 second = base::sample(x = seconds, size = 1, replace = TRUE))},
                              simplify = F)
        
        sim_list2 = replicate(n = numReplications,
                              expr = {data.frame(first = base::sample(x = firsts, size = 2, replace = TRUE),
                                                 second = base::sample(x = seconds, size = 2, replace = TRUE))},
                              simplify = F)
        
        sim_list3 = replicate(n = numReplications,
                              expr = {data.frame(first = base::sample(x = firsts, size = 3, replace = TRUE),
                                                 second = base::sample(x = seconds, size = 3, replace = TRUE))},
                              simplify = F)
        
        # Create list
        employeesList = c(sim_list1, sim_list2, sim_list3)
        
        # Method 1
        
        system.time(res1 <- dfbycol(employeesList))
        # > system.time(dfbycol(employeesList))
        # user  system elapsed 
        # 757.87    0.18  758.62 
        # res1
        rm(res1)
        
        #####
        # Method 2
        
        system.time(m <- max( unlist( lapply( employeesList, nrow ) ) ))
        #    user  system elapsed 
        #    0.22    0.00    0.22
        
        system.time(listFinal2 <- lapply( employeesList, myrename2, m = m ) )
        listFinal2
        # user  system elapsed 
        # 16.16    0.01   16.18 
        
        system.time(result2 <- do.call( rbind, listFinal2 ) )
        # result2
        # user  system elapsed 
        # 3.96    0.00    3.96
        rm(listFinal2)
        rm(result2)
        
        #####
        # Method 3
        
        system.time(listFinal3 <- lapply( employeesList, myrename3))
        # user  system elapsed 
        # 7.33    0.00    7.33
        listFinal3
        system.time(result3 <- dplyr::bind_rows( listFinal3 ))
        # user  system elapsed 
        # 0.17    0.00    0.17
        rm(listFinal3)
        rm(result3)
        
        #####
        # Method 4
        
        system.time(listFinal4 <- lapply( employeesList, myrename4) )
        # user  system elapsed 
        # 400.05    0.04  400.24 
        listFinal4
        system.time(result4 <- dplyr::bind_rows( listFinal4 ) )
        #    user  system elapsed 
        #   0.17    0.00    0.17 
        # result4
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2018-05-02
          • 1970-01-01
          • 1970-01-01
          • 2022-06-16
          • 2022-11-30
          • 2019-06-16
          • 2020-03-08
          相关资源
          最近更新 更多