【问题标题】:R speed up the for loop using apply() or lapply() or etcR 使用 apply() 或 lapply() 等加速 for 循环
【发布时间】:2016-12-03 14:46:44
【问题描述】:

我编写了一个特殊的“impute”函数,它根据特定的列名将缺失 (NA) 值的列值替换为 mean() 或 mode()。

输入数据帧是 400,000+ 行并且它的速度很慢,我如何使用 lapply() 或 apply() 来加快插补部分。

这里是功能,标记部分我想用 START OPTIMIZE & END OPTIMIZE 优化:

specialImpute <- function(inputDF) 
{

  discoveredDf <- data.frame(STUDYID_SUBJID=character(), stringsAsFactors=FALSE)
  dfList <- list()
  counter = 1; 

  Whilecounter = nrow(inputDF)
  #for testing just do 10 iterations,i = 10;

  while (Whilecounter >0)
  {

    studyid_subjid=inputDF[Whilecounter,"STUDYID_SUBJID"]

    vect = which(discoveredDf$STUDYID_SUBJID == studyid_subjid)
    #was discovered and subset before 
    if (!is.null(vect))
    {
      #not subset before 
      if (length(vect)<1)
      {
      #subset the dataframe base on regex inputDF$STUDYID_SUBJID
    df <- subset(inputDF, regexpr(studyid_subjid, inputDF$STUDYID_SUBJID) > 0)

      #START OPTIMIZE
      for (i in nrow(df))
      {
      #impute , add column mean & add to list

      #apply(df[,c("y1","y2","y3","etc..")],2,function(x){x[is.na(x)] =mean(x, na.rm=TRUE)})

      if (is.na(df[i,"y1"])) {df[i,"y1"] = mean(df[,"y1"], na.rm = TRUE)}
      if (is.na(df[i,"y2"])) {df[i,"y2"] =mean(df[,"y2"], na.rm = TRUE)}
      if (is.na(df[i,"y3"])) {df[i,"y3"] =mean(df[,"y3"], na.rm = TRUE)}
      #impute using mean for CONTINUOUS variables
        if (is.na(df[i,"COVAR_CONTINUOUS_2"])) {df[i,"COVAR_CONTINUOUS_2"] =mean(df[,"COVAR_CONTINUOUS_2"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_3"])) {df[i,"COVAR_CONTINUOUS_3"] =mean(df[,"COVAR_CONTINUOUS_3"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_4"])) {df[i,"COVAR_CONTINUOUS_4"] =mean(df[,"COVAR_CONTINUOUS_4"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_5"])) {df[i,"COVAR_CONTINUOUS_5"] =mean(df[,"COVAR_CONTINUOUS_5"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_6"])) {df[i,"COVAR_CONTINUOUS_6"] =mean(df[,"COVAR_CONTINUOUS_6"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_7"])) {df[i,"COVAR_CONTINUOUS_7"] =mean(df[,"COVAR_CONTINUOUS_7"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_10"])) {df[i,"COVAR_CONTINUOUS_10"] =mean(df[,"COVAR_CONTINUOUS_10"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_14"])) {df[i,"COVAR_CONTINUOUS_14"] =mean(df[,"COVAR_CONTINUOUS_14"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_30"])) {df[i,"COVAR_CONTINUOUS_30"] =mean(df[,"COVAR_CONTINUOUS_30"], na.rm = TRUE)}
      #impute using mode ordinal & nominal values
        if (is.na(df[i,"COVAR_ORDINAL_1"]))  {df[i,"COVAR_ORDINAL_1"] =Mode(df[,"COVAR_ORDINAL_1"])}
        if (is.na(df[i,"COVAR_ORDINAL_2"]))  {df[i,"COVAR_ORDINAL_2"] =Mode(df[,"COVAR_ORDINAL_2"])}
        if (is.na(df[i,"COVAR_ORDINAL_3"]))  {df[i,"COVAR_ORDINAL_3"] =Mode(df[,"COVAR_ORDINAL_3"])}
        if (is.na(df[i,"COVAR_ORDINAL_4"]))  {df[i,"COVAR_ORDINAL_4"] =Mode(df[,"COVAR_ORDINAL_4"])}
      #nominal 
        if (is.na(df[i,"COVAR_NOMINAL_1"]))  {df[i,"COVAR_NOMINAL_1"] =Mode(df[,"COVAR_NOMINAL_1"])}
        if (is.na(df[i,"COVAR_NOMINAL_2"]))  {df[i,"COVAR_NOMINAL_2"] =Mode(df[,"COVAR_NOMINAL_2"])}
        if (is.na(df[i,"COVAR_NOMINAL_3"]))  {df[i,"COVAR_NOMINAL_3"] =Mode(df[,"COVAR_NOMINAL_3"])}
        if (is.na(df[i,"COVAR_NOMINAL_4"]))  {df[i,"COVAR_NOMINAL_4"] =Mode(df[,"COVAR_NOMINAL_4"])}
        if (is.na(df[i,"COVAR_NOMINAL_5"]))  {df[i,"COVAR_NOMINAL_5"] =Mode(df[,"COVAR_NOMINAL_5"])}
        if (is.na(df[i,"COVAR_NOMINAL_6"]))  {df[i,"COVAR_NOMINAL_6"] =Mode(df[,"COVAR_NOMINAL_6"])}
        if (is.na(df[i,"COVAR_NOMINAL_7"]))  {df[i,"COVAR_NOMINAL_7"] =Mode(df[,"COVAR_NOMINAL_7"])}
        if (is.na(df[i,"COVAR_NOMINAL_8"]))  {df[i,"COVAR_NOMINAL_8"] =Mode(df[,"COVAR_NOMINAL_8"])}

      }#for
      #END OPTIMIZE

      dfList[[counter]] <- df 
      #add to discoveredDf since already substed
      discoveredDf[nrow(discoveredDf)+1,]<- c(studyid_subjid)
      counter = counter +1;
      #for debugging to check progress
        if (counter %% 100 == 0)
        {
        print(counter)
        }
      }
    }


    Whilecounter  = Whilecounter  -1;
  }#end while
  return (dfList)

}

谢谢

【问题讨论】:

  • if (is.na(df[i, "y1"] ... 移到循环外,作为df[is.na(df$y1), "y1"] = mean(df$y1, na.rm=TRUE)。这将迭代“矢量化”,其中每列只有一个 R 调用,而不是 nrow(df) 调用。对循环中的所有行重复此操作。

标签: r lapply


【解决方案1】:

这是一个使用data.table 的非常简单快速的解决方案。

library(data.table)

# name of columns
cols <- c("a", "c")

# impute date
setDT(dt)[, (cols) := lapply(.SD, function(x) ifelse( is.na(x) & is.numeric(x), mean(x, na.rm = T),
                                               ifelse( is.na(x) & is.character(x), names(which.max(table(x))), x)))  , .SDcols = cols ]

我没有将该解决方案的性能与@Simon Jackson 提供的性能进行比较,但这应该很快。

来自可重现示例的数据

set.seed(25)
dt <- data.table(a=c(1:5,NA,NA,1,1), 
                 b=sample(1:15, 9, replace=TRUE), 
                 c=LETTERS[c(1:6,NA,NA,1)])

【讨论】:

    【解决方案2】:

    只要您在每个上使用矢量化函数,就可以通过多种方式提高性能。目前,您正在遍历每一行,然后分别处理每一列,这确实会减慢您的速度。另一个改进是概括代码,这样您就不必为每个变量输入新行。在我将在下面给出的示例中,这是因为连续变量是数字的,而分类是因子。

    要直接获得答案,您可以将要优化的代码替换为以下内容(尽管固定变量名称),前提是您的数字变量是数字且序数/分类不是(例如,因子):

    impute <- function(x) {
      if (is.numeric(x)) {  # If numeric, impute with mean
        x[is.na(x)] <- mean(x, na.rm = TRUE)
      } else {                # mode otherwise
        x[is.na(x)] <- names(which.max(table(x)))
      }
      x
    }
    
    # Correct cols_to_impute with names of your variables to be imputed
    # e.g., c("COVAR_CONTINUOUS_2", "COVAR_NOMINAL_3", ...)  
    cols_to_impute <- names(df) %in% c("names", "of", "columns")
    library(purrr)
    df[, cols_to_impute] <- dmap(df[, cols_to_impute], impute)
    

    以下是五种方法的详细比较:

    • 您使用for 对行进行迭代的原始方法;然后分别处理每一列。
    • 使用for 循环。
    • 使用lapply()
    • 使用sapply()
    • 使用purrr 包中的dmap()

    新方法都按列在数据帧上进行迭代,并使用称为impute 的向量化函数,该函数用平均值(如果是数字)或众数来估算向量中的缺失值(否则)。否则,它们的差异相对较小(您将看到 sapply() 除外),但检查起来很有趣。

    以下是我们将使用的实用函数:

    # Function to simulate a data frame of numeric and factor variables with
    # missing values and `n` rows
    create_dat <- function(n) {
      set.seed(13)
      data.frame(
        con_1 = sample(c(10:20, NA), n, replace = TRUE),   # continuous w/ missing
        con_2 = sample(c(20:30, NA), n, replace = TRUE),   # continuous w/ missing
        ord_1 = sample(c(letters, NA), n, replace = TRUE), # ordinal w/ missing
        ord_2 = sample(c(letters, NA), n, replace = TRUE)  # ordinal w/ missing
      )
    }
    
    # Function that imputes missing values in a vector with mean (if numeric) or
    # mode (otherwise)
    impute <- function(x) {
      if (is.numeric(x)) {  # If numeric, impute with mean
        x[is.na(x)] <- mean(x, na.rm = TRUE)
      } else {                # mode otherwise
        x[is.na(x)] <- names(which.max(table(x)))
      }
      x
    }
    

    现在,每种方法的包装函数:

    # Original approach
    func0 <- function(d) {
      for (i in 1:nrow(d)) {
        if (is.na(d[i, "con_1"])) d[i,"con_1"] <- mean(d[,"con_1"], na.rm = TRUE)
    
        if (is.na(d[i, "con_2"])) d[i,"con_2"] <- mean(d[,"con_2"], na.rm = TRUE)
    
        if (is.na(d[i,"ord_1"])) d[i,"ord_1"] <- names(which.max(table(d[,"ord_1"])))
    
        if (is.na(d[i,"ord_2"])) d[i,"ord_2"] <- names(which.max(table(d[,"ord_2"])))
      }
      return(d)
    }
    
    # for loop operates directly on d
    func1 <- function(d) {
      for(i in seq_along(d)) {
        d[[i]] <- impute(d[[i]])
      }
      return(d)
    }
    
    # Use lapply()
    func2 <- function(d) {
      lapply(d, function(col) {
        impute(col)
      })
    }
    
    # Use sapply()
    func3 <- function(d) {
      sapply(d, function(col) {
        impute(col)
      })
    }
    
    # Use purrr::dmap()
    func4 <- function(d) {
      purrr::dmap(d, impute)
    }
    

    现在,我们将比较这些方法的性能,n 范围为 10 到 100(非常小):

    library(microbenchmark)
    ns <- seq(10, 100, by = 10)
    times <- sapply(ns, function(n) {
      dat <- create_dat(n)
      op <- microbenchmark(
        ORIGINAL = func0(dat),
        FOR_LOOP = func1(dat),
        LAPPLY   = func2(dat),
        SAPPLY   = func3(dat),
        DMAP     = func4(dat)
      )
      by(op$time, op$expr, function(t) mean(t) / 1000)
    })
    times <- t(times)
    times <- as.data.frame(cbind(times, n = ns))
    
    # Plot the results
    library(tidyr)
    library(ggplot2)
    
    times <- gather(times, -n, key = "fun", value = "time")
    pd <- position_dodge(width = 0.2)
    ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
      geom_point(position = pd) +
      geom_line(position = pd) +
      theme_bw()
    

    很明显,原始方法比在每列上使用矢量化函数impute 的新方法慢得多。新品之间的区别是什么?让我们增加样本量来检查:

    ns <- seq(5000, 50000, by = 5000)
    times <- sapply(ns, function(n) {
      dat <- create_dat(n)
      op <- microbenchmark(
        FOR_LOOP = func1(dat),
        LAPPLY   = func2(dat),
        SAPPLY   = func3(dat),
        DMAP     = func4(dat)
      )
      by(op$time, op$expr, function(t) mean(t) / 1000)
    })
    times <- t(times)
    times <- as.data.frame(cbind(times, n = ns))
    times <- gather(times, -n, key = "fun", value = "time")
    pd <- position_dodge(width = 0.2)
    ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
      geom_point(position = pd) +
      geom_line(position = pd) +
      theme_bw()
    

    看起来sapply() 不是很好(正如@Martin 指出的那样)。这是因为sapply() 正在做额外的工作来将我们的数据变成矩阵形状(我们不需要)。如果您在没有sapply() 的情况下自己运行此程序,您会发现其余方法都相当相似。

    因此,主要的性能改进是在每列上使用矢量化函数。我建议一开始就使用dmap,因为我是函数风格和purrr 包的粉丝,但您可以轻松地替换您喜欢的任何方法。

    除此之外,非常感谢@Martin 非常有用的评论让我改进了这个答案!

    【讨论】:

    • 不错的答案。费用在使用sapply();与lapply() 比较(for 循环修改 d,最好用d 作为参数然后system.time(f0(d)) 编写函数,另见微基准包)。 ifelse() 不是用来测试标量条件(两个结果都被评估)所以使用一个普通的旧 if () else。 l/sapply 方法的警告是有意义的——两种方法的结果不相同,使用names(which.max(table(col)))。一般来说,得到正确的答案比快速的答案更重要——identical(f0(d), f1(d))
    • @SimonJackson 你不需要 as.data.frame 来进行 lapply。可以像 'dmap' 一样写:df[, cols_to_impute] &lt;- lapply(df[, cols_to_impute], impute).
    • @GregoryDe​​min 好点,也适用于 sapply() ,不是吗?进行了编辑。
    • 很好的答案。我也喜欢purrr 语法。任何想法为什么它对最小的 n 表现如此之好?
    【解决方案3】:

    如果您要使用看起来像矩阵的东西,那么请使用矩阵而不是数据框,因为像矩阵一样对数据框进行索引是非常昂贵的。您可能希望将数值提取到矩阵中以进行部分计算。这可以显着提高速度。

    【讨论】:

      猜你喜欢
      • 2022-01-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-07-12
      相关资源
      最近更新 更多