R编程速度[关闭]答案

【问题标题】：R Programming Speed [closed]R编程速度[关闭]
【发布时间】：2015-10-04 23:54:42
【问题描述】：

有没有什么方法可以通过应用函数来加速下面的 R 代码？我还不能完全理解 apply 函数的实现，但我知道它们可能有助于减少 for 循环的计算时间。

这是代码，感谢您的输入！

for(i in 1:(365 + leap)){
    ## Filename
    IMS.file = paste('crb_',km,'km_v00_',year,
                     sprintf("%03d", i),'.asc.csv',sep='')

    ## The next step is to check that file exists
    if(file.exists(paste(IMS.folder,'/',IMS.file,sep=''))){

        ## Reads in file
        dat = read.table(paste(IMS.folder,'/',IMS.file,sep=''),
            sep=',')

  ## This removes all the values outside of the basin
        point.in.polygon(dat[,4],dat[,3],basin.coord[,1],basin.coord[,2])
  xy = as.logical(point.in.polygon(dat[,4],dat[,3],basin.coord[,1],basin.coord[,2]))

  ## Keeps points of interest, eliminates first 2 columns and reverses lat & lon
  dat = dat[xy,3:5]

    ## Changes 4 to snow (1) and all others to 0
        dat[,3] <- ifelse(dat[,3] == 4,1,0)

        ## Creates a new file (may need to be updated for new computers)
        newfile = paste('C:/Users/Ben/Documents/Columbia Project/',
                        'ColumbiaRiverBasin_Report/IMS',km,'SubBasins/',subbasin,
                        '/',year,'/',year,sprintf("%3d", i),'.csv',sep='')

        ## Adds the file header to the spreadsheet
        write.table(file.head,file=newfile,quote=FALSE,row.names=FALSE,
            col.names=FALSE)

        ## Appends file with the trimmed data
        write.table(dat,file=newfile,append=TRUE,
            quote=FALSE,row.names=FALSE,col.names=FALSE,sep=',')
    }
    else{
        ## Outputs missing day (1-366 )
        print(paste('Missing Day Number: ', i),quote=FALSE)
    }
}

【问题讨论】：

可能不会：(1) 应用函数通常最多会导致非常轻微/微妙的速度提升（搜索并阅读 Patrick Burns 的 R Inferno）；（2）如果可能的话，通常需要矢量化； (3) 如果您必须单独处理日常文件，您可能会搞砸。唯一的例外是您可以使用parallel 包中的apply() 的并行版本...
PS 我不知道你的瓶颈在哪里（很可能是points.in.polygon 或文件 I/O），但是 . ...在第二次运行并存储结果之前，您是否有理由无用地运行 point.in.polygon(dat[,4],dat[,3],basin.coord[,1],basin.coord[,2]) 一次？
本-谢谢；我什至没有注意到这一点，它确实加快了速度。

标签： r

【解决方案1】：

我看到data.table package 的速度显着提高，尤其是fread 非常棒，您可以选择特定的列而不是阅读所有列。如果您的文件很大，这将大大加快速度。所以你的阅读会变成：

## Reads in file
## Not this: dat = read.table(paste(IMS.folder,'/',IMS.file,sep=''),
##        sep=',')

## This instead
dat <- fread("<filename>", select = <list of columns you want>)

【讨论】：

非常有帮助！这太棒了——我从来不知道这个功能。谢谢！

【解决方案2】：

library(dplyr)

read_special = function(sub_df)
  with(sub_df,
       "crb_%skm_v00%s%03d.asc.csv" %>%
         sprintf(first(km), first(year), first(day)) %>%
         paste(IMS.folder, . , sep='/') %>%
         read.table(sep = ",",
                    col.names = c(
                      "trash1",
                      "trash2",
                      "latitude",
                      "longitude",
                      "some_other_variable") )

read_special = failwith(data_frame(), read_special)

# start by building a dataframe called catalog with columns year, day, and km 
# for all the files you think you have

catalog %>%
  group_by(year, day, km) %>%
  do(read_special(.)) %>%
  select(-trash1, -trash2) %>%
  filter(point.in.polygon(longitude, latitude,
                          basin.coord$longitude, basin.coord$latitude))) %>%
  write.csv("C:/Users/Ben/Documents/Columbia Project/all_data.csv",
            row.names = FALSE)

虽然我不知道该怎么做，但我同意使用 fread （或 library(readr) ）的建议。一些注意事项：如果可能，请始终使用列名而不是数字。它使一切变得更加清晰。 failwith 有点危险，因为它可能会掩盖除了不存在的 csv 之外的错误。有点真实。我强烈建议您将所有数据准确写入 1 个 csv 文件。

【讨论】：