【问题标题】:expand large data frame in R efficiently有效地扩展 R 中的大数据框
【发布时间】:2016-08-08 22:13:07
【问题描述】:

在给定列的值的情况下,我正在寻找一种解决方案,将 R 中的大型数据框扩展到更多列和更多行。

现在我正在使用 for-loop 方法执行此操作,但我确信有更多奇特/有效的方法可以实现相同的结果...

我认为这个例子会让问题更清楚。假设我们有一个数据框,其中包含学生在人生三个不同阶段的成绩信息。学生 ID 为 s1、s2 和 s3;我们测量了他们一生中三个不同时期的成绩,m1、m2和m3;然后在每个阶段,我们都有一个名为 more.info 的列,其中包含他们在课程中的成绩,在所有课程中编码为 class#topic#grade。

library(stringr)
options(stringsAsFactors=FALSE)
example.df = data.frame(measure.id = c("m1", "m2", "m3", "m2", "m2", "m3", "m1", "m1", "m3"),
                        student.id = c("s1", "s1", "s1", "s2", "s3", "s3", "s2", "s3", "s2"),
                        more.info = c("draw#drawing#4.5;music#singing#5.6;dance#ballet#6.7", "bio#biology#5.6;math#algebra#4.5", "calculus#univariate#6.2; physics#quantum#4.5;chemistry#organic#4.5", 
                                      "bio#biology#5.6;math#algebra#4.5", "bio#biology#3.6;math#algebra#3.5", "calculus#univariate#5.2; physics#quantum#5.2;chemistry#organic#4", "draw#drawing#5;music#singing#5.6;dance#ballet#5.7", 
                                      "draw#drawing#2.5;music#singing#3.6;dance#ballet#4", "calculus#univariate#5.2; physics#quantum#6.5;chemistry#organic#5"))
measure.ids = unique(example.df$measure.id)

然后,我想创建一个新的数据框,将more.info信息拆分并创建一个具有更多行和更多列的新数据框,如下所示,

new.df=data.frame()
splitit <- function(x){
  strsplit(x, '#')
}
for(i in 1:length(measure.ids)){
  measure.id = measure.ids[i]
  tmp = example.df[example.df==measure.id,]
  more.info = tmp$more.info
  more.info = strsplit(more.info,";")
  student.ids = tmp$student.id
  for(j in 1:length(more.info))
  {
    info = more.info[[j]]
    a = sapply(info, splitit)
    b = sapply(a, "[[", 1)
    d = sapply(a, "[[", 2)
    e = sapply(a, "[[", 3)
    new.df = rbind(new.df, 
                   data.frame(measure.id = rep(measure.id, length(info)),
                              student.id = rep(tmp$student.id[j], length(info)),
                              class = b, 
                              topic = d,
                              grade = e)
                   )
  }
}

在 R 中实现这一目标的最有效方法是什么?我愿意应用函数、map/reduce 方法、mclapply 以使用更多内核等...

【问题讨论】:

    标签: r dataframe data.table


    【解决方案1】:

    具有基本功能的解决方案:

    # split column by all available separators 
    a <- strsplit(example.df$more.info, "; |#|;")
    # represent each result as a matrix with 3 columns
    a <- lapply(a, function(v) matrix(v, ncol=3, byrow=TRUE))
    # combine all matrixes in one big matrix
    aa <- do.call(rbind, a)
    # create indices of rows of initial data.frame which corresponds to the created big matrix
    b <- unlist(sapply(seq_along(a), function(i) rep(i, nrow(a[[i]]))))
    # combine initial data.frame and created big matrix
    df <- cbind(example.df[b,], aa)
    # remove unnecessary columns and rename remaining ones
    df <- df[,-3]
    colnames(df)[3:5] <- c("class", "topic", "grade")
    

    为了提高速度,您可以用mclapply 替换我代码中apply 系列的所有功能。

    我无法比较速度,因为您的数据集非常小。

    【讨论】:

      【解决方案2】:

      这是使用data.table 的另一种方法。

      基本上,我将整个数据转换过程放在一行中。

      # Load R package
      library(data.table)    
      
      # Convert to data.table object
      example.dt <- as.data.table(example.df)
      
      # Transform the data
      final.dt <- example.dt[, data.table(do.call(rbind, unlist(lapply(strsplit(x=more.info, split=";"), strsplit, "#"), recursive=FALSE))), by=c("measure.id", "student.id")]
      
      # Rename variables
      setnames(final.dt, old=c("V1", "V2", "V3"), new=c("class", "topic", "grade"))
      
      
      # > final.dt
      #     measure.id student.id class   topic grade
      #  1:         m1         s1  draw drawing   4.5
      #  2:         m1         s1 music singing   5.6
      #  3:         m1         s1 dance  ballet   6.7
      #  4:         m2         s1 dance drawing   5.6
      #  5:         m2         s1  draw  ballet   4.5
      #  6:         m3         s1  draw singing   5.6
      #  7:         m3         s1 dance drawing   4.5
      #  8:         m3         s1 music  ballet   4.5
      #  9:         m2         s2 dance drawing   5.6
      # 10:         m2         s2  draw  ballet   4.5
      # 11:         m2         s3 dance drawing   5.6
      # 12:         m2         s3  draw  ballet   4.5
      # 13:         m3         s3  draw singing   5.6
      # 14:         m3         s3 dance drawing   5.6
      # 15:         m3         s3 music  ballet   4.5
      # 16:         m1         s2  draw drawing   4.5
      # 17:         m1         s2 music singing   5.6
      # 18:         m1         s2 dance  ballet   6.7
      # 19:         m1         s3  draw drawing   4.5
      # 20:         m1         s3 music singing   5.6
      # 21:         m1         s3 dance  ballet   6.7
      # 22:         m3         s2  draw singing   5.6
      # 23:         m3         s2 dance drawing   6.7
      # 24:         m3         s2 music  ballet   4.5
      #     measure.id student.id class   topic grade
      

      【讨论】:

        【解决方案3】:

        这个答案有一些方法可能有助于加速你的代码(例如mclapplydata.table 包)。

        require("data.table")
        require("parallel")
        require("plyr")
        
        #Note the mclapply function.  If you
        #are running Mac or Linux, this should be more efficient for you
        list.of.dfs <- mclapply(strsplit(example.df$more.info, "; |#|;"),FUN=function(x) as.data.frame(t(x)),mc.cores=1)
        combined.df <- rbind.fill(list.of.dfs)
        
        
        #Use data.table for speed and efficiency.
        #example.df <- data.table(cbind(example.df,combined.df))
        example.df <- data.table(example.df)
        example.df[,paste0(c("class","topic","grade"),
                     c(rep(1,3),rep(2,3),rep(3,3))):=lapply(combined.df,I)]
        
        #delete unnecessary column
        example.df[,more.info:=NULL]
        
        
        #rbindlist final table (efficient way to rbind)
        table1 <- example.df[,list(measure.id,student.id,class=class1,topic=topic1,grade=grade1)]
        table2 <- example.df[,list(measure.id,student.id,class=class2,topic=topic2,grade=grade2)]
        table3 <- example.df[,list(measure.id,student.id,class=class3,topic=topic3,grade=grade3)]
        
        #final results
        final.table <- rbindlist(list(table1,table2,table3))[!is.na(class)]
        final.table
        

        【讨论】:

          【解决方案4】:

          或许你可以试试我写的两个函数concat.split.DTcSplit。两者目前都可以作为 GitHub Gists 使用,可以通过“devtools”包轻松加载。

          library(devtools)
          source_gist(6873058)  # for concat.split.DT
          source_gist(11380733) # for cSplit
          
          concat.split.DT(cSplit(example.df, splitCols="more.info", sep=";", direction="long"), 
                          splitcols="more.info", sep="#")
          #     measure.id student.id more.info_1 more.info_2 more.info_3
          #  1:         m1         s1        draw     drawing         4.5
          #  2:         m1         s1       music     singing         5.6
          #  3:         m1         s1       dance      ballet         6.7
          #  4:         m2         s1         bio     biology         5.6
          #  5:         m2         s1        math     algebra         4.5
          #  6:         m3         s1    calculus  univariate         6.2
          #  7:         m3         s1     physics     quantum         4.5
          #  8:         m3         s1   chemistry     organic         4.5
          #  9:         m2         s2         bio     biology         5.6
          # 10:         m2         s2        math     algebra         4.5
          # 11:         m2         s3         bio     biology         3.6
          # 12:         m2         s3        math     algebra         3.5
          # 13:         m3         s3    calculus  univariate         5.2
          # 14:         m3         s3     physics     quantum         5.2
          # 15:         m3         s3   chemistry     organic         4.0
          # 16:         m1         s2        draw     drawing         5.0
          # 17:         m1         s2       music     singing         5.6
          # 18:         m1         s2       dance      ballet         5.7
          # 19:         m1         s3        draw     drawing         2.5
          # 20:         m1         s3       music     singing         3.6
          # 21:         m1         s3       dance      ballet         4.0
          # 22:         m3         s2    calculus  univariate         5.2
          # 23:         m3         s2     physics     quantum         6.5
          # 24:         m3         s2   chemistry     organic         5.0
          #     measure.id student.id more.info_1 more.info_2 more.info_3
          

          names 列稍后可以使用setnames 轻松更改。

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 2014-03-30
            • 2021-12-08
            • 1970-01-01
            • 2019-09-30
            • 1970-01-01
            • 1970-01-01
            • 2018-11-06
            • 1970-01-01
            相关资源
            最近更新 更多