在 R 中加载和处理 1000 多个文件答案

【问题标题】：Loading and processing 1000s of files in R在 R 中加载和处理 1000 多个文件
【发布时间】：2012-06-21 23:12:34
【问题描述】：

我对 R 有点陌生，所以请原谅这里的新手...

我正在用 R 编写代码，以在脚本中加载 1000 个保存的数据帧（文件），该脚本对每个文件中的数据运行函数并将结果值存储在向量中。我必须用不同的功能一遍又一遍地做这件事，目前这需要很长时间。

我正在尝试使用多核 mclapply 并行化该过程，但不幸的是，2 到 8 个内核之间的任何东西似乎都比仅在一个内核上运行它需要更长的时间。

由于磁盘 I/O 限制，这个想法从根本上来说是不合理的吗？多核，甚至 R，不是正确的解决方案吗？用 Python 之类的方法打开文件，然后在内容上运行 R 函数会比 R 更好吗？

对此的任何指导或想法将不胜感激 -

为清楚起见添加了代码：

    library(multicore)

    project.path = "/pathtodata/"

    #This function reads the file location and name, then loads it and runs a simple statistic
    running_station_stats <- function(nsrdb_stations)
    {
      varname <- "column_name"
      load(file = paste(project.path, "data/",data_set_list[1], sep = ""))
      tempobj <- as.data.frame(coredata(get(data_set_list[2])))
      mean(tempobj[[varname]],na.rm=TRUE)
    }



    options(cores = 2)

    #This file has a list of R data files data_set_list[1] and the names they were created with data_set_list[2]
    load(file = paste(project.path, "data/data_set_list.RData", sep = ""))

    thelist <- list()

    thelist[[1]] <- data_set_list[1:50,]

    thelist[[2]] <- data_set_list[51:100,]

    thelist[[3]] <- data_set_list[101:150,]

    thelist[[4]] <- data_set_list[151:200,]


    #All three of these are about the same speed to run regardless of the num of cores
    system.time(
    {
      apply(nsrdb_stations[which(nsrdb_stations$org_stations==TRUE),][1:200,],1,running_station_stats)
    })

    system.time(
      lapply(thelist, apply, 1, running_station_stats)
     )

    system.time(
      mclapply(thelist, apply, 1, running_station_stats)
    )

【问题讨论】：

除非您向我们展示您的代码在做什么，否则无法知道，其他人只能猜测。您可以测试自己是文件访问是限制问题还是其他问题。一旦你将问题简化为它的组成部分，你就会开始自己回答这个问题。
如果数据已经在数据帧中，那么通过另一种语言编组 R 原生格式可能只会减慢速度。但正如 mdsumner 所说，我们只能根据您所展示的内容进行猜测。
mdsumner - 如果这个简化的代码示例中的某些内容让您作为表演犬跳出来，请告诉我。如果我发现可以加快处理速度的内容，我会在此处发布。

标签： python r multicore

【解决方案1】：

Python 和 R 都会尝试使用多个内核来处理数字运算等问题。它对读取大量文件没有帮助。多线程也不是答案（re python GIL）。

一些可能的解决方案（都不简单）是：

在可以（某些）文件 io 异步的地方使用类似 twisted 的东西。很难编程，而且对 numpy 不太友好。
使用 Celery 或其他一些自制的主从解决方案。大量滚动您自己的操作。
使用 Ipython (w/ ipcluster) 生成多个进程，python 将为您重新组合（最佳解决方案 IMO）

【讨论】：

感谢您的意见和建议。如果其中任何一个成功，我会回复。
GIL 是一个 python 线程问题。 R 中的多核包不会遇到这个问题。然而，读取大量文件更受可用磁盘 io 数量的限制，而不是线程数量。我认为 python multiprocessing 库不受 GIL 的影响。这个解决方案类似于 lpython。

【解决方案2】：

我会首先在 Python 中尝试良好的老式多处理。上面的选项也是可能的。这是使用多处理模块执行批处理作业的示例。


import multiprocessing as mp
import time

def worker(x):
    time.sleep(0.2)
    print "x= %s, x squared = %s" % (x, x*x)
    return x*x

def apply_async():
    pool = mp.Pool()
    for i in range(100):
        pool.apply_async(worker, args = (i, ))
    pool.close()
    pool.join()

if name == 'main':
    apply_async()

输出如下所示：


x= 0, x squared = 0
x= 1, x squared = 1
x= 2, x squared = 4
x= 3, x squared = 9
x= 4, x squared = 16
x= 6, x squared = 36
x= 5, x squared = 25
x= 7, x squared = 49
x= 8, x squared = 64
x= 10, x squared = 100
x= 11, x squared = 121
x= 9, x squared = 81
x= 12, x squared = 144

如您所见，数字不是按顺序排列的，因为它们是异步执行的。只需更改上面的 worker() 函数来进行处理，并可能使用 mp.Pool(10) 或 mp.Pool(15) 或其他方式更改并发进程的数量。像这样的事情应该是相对困难的。 . .

【讨论】：