如何防止 Rstudio 崩溃？答案

【问题标题】：How to prevent Rstudio from crashing?如何防止 Rstudio 崩溃？
【发布时间】：2021-04-19 08:21:56
【问题描述】：

我目前正在为我的考试做一个机器学习项目。我的电脑有 32gb 的 RAM，有一个 12 核的 I7。我的会话信息如下，

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8      
 [2] LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                 
 [9] LC_ADDRESS=C              
[10] LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8
[12] LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils    
[6] datasets  methods   base     

other attached packages:
 [1] forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2    
 [4] purrr_0.3.4     readr_1.4.0     tidyr_1.1.2    
 [7] tibble_3.0.4    tidyverse_1.3.0 here_1.0.1     
[10] caret_6.0-86    ggplot2_3.3.3   lattice_0.20-41

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5           lubridate_1.7.9.2   
 [3] class_7.3-17         assertthat_0.2.1    
 [5] rprojroot_2.0.2      ipred_0.9-9         
 [7] foreach_1.5.1        R6_2.5.0            
 [9] cellranger_1.1.0     plyr_1.8.6          
[11] backports_1.2.1      reprex_0.3.0        
[13] stats4_4.0.3         httr_1.4.2          
[15] pillar_1.4.7         rlang_0.4.10        
[17] readxl_1.3.1         rstudioapi_0.13     
[19] data.table_1.13.6    rpart_4.1-15        
[21] Matrix_1.3-2         splines_4.0.3       
[23] gower_0.2.2          munsell_0.5.0       
[25] broom_0.7.3          compiler_4.0.3      
[27] modelr_0.1.8         pkgconfig_2.0.3     
[29] nnet_7.3-14          tidyselect_1.1.0    
[31] prodlim_2019.11.13   codetools_0.2-18    
[33] fansi_0.4.1          crayon_1.3.4        
[35] dbplyr_2.0.0         withr_2.3.0         
[37] MASS_7.3-53          recipes_0.1.15      
[39] ModelMetrics_1.2.2.2 grid_4.0.3          
[41] nlme_3.1-151         jsonlite_1.7.2      
[43] gtable_0.3.0         lifecycle_0.2.0     
[45] DBI_1.1.0            magrittr_2.0.1      
[47] pROC_1.16.2          scales_1.1.1        
[49] cli_2.2.0            stringi_1.5.3       
[51] reshape2_1.4.4       fs_1.5.0            
[53] timeDate_3043.102    xml2_1.3.2          
[55] ellipsis_0.3.1       generics_0.1.0      
[57] vctrs_0.3.6          lava_1.6.8.1        
[59] iterators_1.0.13     tools_4.0.3         
[61] glue_1.4.2           hms_0.5.3           
[63] survival_3.2-7       colorspace_2.0-0    
[65] rvest_0.3.6          haven_2.3.1

我的数据是 50.000 x 30，最初我使用以下代码对模型进行了分类和回归问题的训练，

models <- list()

# Generate cluster
genCluster <- makeCluster(
  spec = detectCores() - 1
)

registerDoParallel(
  cl = genCluster
)

set.seed(1903)
system.time(
  for (i in 1:length(Algorithms)){
    
   
    
    # train models
    suppressWarnings(
      models[[i]] <- train(
        form = Y ~ .,
        data = df,
        method = Algorithms[i],
        trControl = trainControl(
          method = "repeatedcv",
          number = 10,
          repeats = 3,
          index = myFolds,
          verboseIter = F,
          allowParallel = T
        )
      )
    )
    
    
  }
)

stopCluster(
  cl = genCluster
)

}

在我运行整个脚本之前，我会从我的数据中随机抽取一个样本，以测试我的脚本是否有效。所以在我的测试运行中，我通常运行 2000 个观察值。这通常很有效。

但是，每当我使用整个数据集时，我都会收到反序列化错误或一些相关的“死”-worker 错误。如果这没有发生，那么我的 R 会话就会崩溃。 注意：当我通过我的大学超级计算机在 64 核和 320gb RAM 实例上运行相同的代码时，也会发生这种情况。

我是如何尝试解决问题的

我没有使用最大核心数，而是使用了等于 k 倍数的数字 - 所以 10。这有助于（有点）解决与工作人员/核心相关的错误。对于我的情况，这些错误似乎是相当随机的。但是，R Session 崩溃仍然存在。
我决定不使用 R Studio，而是通过终端执行我的脚本，但是，由于我的脚本中的每个相对路径都在根项目目录中，通过 30 多个脚本来更改这似乎与 RStudio 不成比例应该工作。出于某种奇怪的原因，setwd()通过 R 终端不会影响子脚本！
在执行每个繁重的脚本之前，我尝试清理环境和内存。

rm(
  list=setdiff(
    ls(), 
    c("importantParameters",
      "train.data",
      "estimateFoo",
      "bestPick")
  )
)


gc(full = T, verbose = F)

这并没有改变任何关于崩溃或与工作人员/核心相关的错误。

我的新方法

放弃这一点后，我采用了一种新方法，改用mclapply。它相当慢，并且不像我想象的那样工作。请注意我在这个版本中有alllowParallel = F，因为我希望mclappy 同时运行列表中的所有模型。从我的系统监视器中可以看出，情况并非如此

estimateFoo <- function(algorithms, equation, cores, plot = F, data, trainObject, type = NULL, plot.name = NULL, metric = c("RMSE")){
  
  # Packages
  require(parallel)
  require(caret)
  require(tidyverse)
  
  # This function estimates all algorithms. Must be provided by a vector of characters.
  # FULL TrainObjects from Caret has to be provided.
  # If plot == T it plots in a tryCatch fashion, to avoid Errors.
  # NOTE: Type has to be oneof classification or regression (As the folders are named.)
  
  trainedModels <- suppressWarnings(mclapply(
      X = algorithms,
      FUN = function(x){
        
        tryCatch(
          train(
            form   = equation,
            data   = data,
            method = x,
            trControl = trainObject
          )
        )
        
      },
      mc.cores = cores
    )
  )
  
  
  
  # Identify TryErrors and remove them. Otherwise the
  # script breaks down
  tryErrorIndicator <- sapply(trainedModels, FUN = class) %in% c("try-error", "NULL")
   
  # # Remove TryErrors
  trainedModels <- trainedModels[!tryErrorIndicator]
  
  # Name List Elements
  names(trainedModels) <- algorithms[!tryErrorIndicator]
  
  # NOTE: It ignores NULL elements, which are due
  # to dead workers. This indicator removes them.
  deadWorker <- which(sapply(trainedModels, is.null))
  
  # If plot is true; then it plots all models and saves
  if (isTRUE(plot)){

    # Generate resamples; and remove those that are empty
    modelResample <- trainedModels[-deadWorker] %>%
      resamples()

    print(
      dotplot(
        modelResample,
        metric = metric,
        scales = list(x = list(relation = "free"),
                      y = list(cex = 1.2))
      )
    )


    dev.copy(pdf, here("results","models", paste(type), paste(plot.name)))
    dev.off()



  }
  
  return(
    trainedModels[-deadWorker]
  )
}

这种新方法虽然速度较慢，但很有效。但是，我的 RSession 仍然崩溃了！

我该怎么办？我如何正确在 R 中进行机器学习而不致失去理智，并浪费 4 天时间让 R 运行我的所有代码而不会崩溃？

【问题讨论】：

我意识到我的帖子并不像应有的那样“一般”。但是，我确信我不是唯一一个遇到这个问题的人；如果我们能找到解决方案，我们可以将这篇文章编译成“操作方法”帖子，这样除了解决我的问题之外，它还可以使所有人受益！
htop 在崩溃前显示什么？我只在终端工作，一个崩溃另一个崩溃……你看过 /var/crash/whoopsie 输出吗？显然，我认为这是一个内存问题，并注意您为清除环境所做的努力。或许可以输入gc()[i]，
我会尝试 mlr3 我发现它在并行时更稳定，并且它只需要一行额外的代码：mlr3book.mlr-org.com/parallelization.html
ubuntu 焦点 apt-get，htop 交互式进程查看器（终端），我用来看看为什么我会崩溃 8G 内存运行模型，或尝试。 gc()[i] 或 gc()[x] 可能会或不会释放任何产生和遗忘的未使用对象。把它塞进哪里，把它放在) 层次结构中让我望而却步，尤其是在跟踪匿名函数索引的地方......不幸的是，这是一种巫术方法。
@Serkan mlr3 在 CRAN 上发布。它分为多个包，安装它们的最简单方法是安装mlr3verse。关于mlr3的期刊文章是here，要学习mlr3最好阅读book并查看gallery examples。

标签： r parallel-processing rstudio r-caret

【解决方案1】：

我将在我得到的 cmets 的帮助下回答我自己的问题。如果有人有一些贡献，或者觉得这篇文章无关紧要 - 请将其标记为删除。

R Sessions 崩溃主要是由于内存不足。因此，如果您正在使用网格搜索训练模型，那么您需要粗略估计它将占用多少 RAM 才能顺利运行。是否可以通过更改设置returnData = F等函数中的一些参数来限制RAM的使用，由于时间限制我没有测试。
使用 allowParallel = T 训练您的模型，将在工作人员之间平均分配 RAM 量，因此 RAM 使用量大约以线性方式增加，这样当同时训练模型时 RAM 很快就会用完。

因此，到目前为止，解决方案必须是获得更多 RAM、减少数据大小或限制网格搜索。

不要不要使用allowParallel = T，而不考虑您拥有的内存量。这对我来说是新的。我希望这对您有所帮助，也对我有所帮助。

【讨论】：