将 R 脚本应用于文件夹中的多个 .txt 文件答案

【问题标题】：Apply an R script over multiple .txt files in a folder将 R 脚本应用于文件夹中的多个 .txt 文件
【发布时间】：2020-01-17 20:53:23
【问题描述】：

我对构建函数和循环非常陌生。我已经查看了与我的问题类似的先前问题，但我似乎无法找到我的问题的解决方案。我的目标是从这样的网页中提取气候数据：

https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations=NE3065&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year=2019

我将使用这些数据来计算作物生长模型的生长期天数。我已经成功地使用 for 循环提取数据。

uticaNE <- "https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations=NE8745&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year=2019"

friendNE <- "https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations=NE3065&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year=2019"

location.urls <- c(uticaNE, friendNE)
location.meso.files <- c("uticaNe.txt", "friendNE.txt")

for(i in seq_along(location.urls)){
  download.file(location.urls[i], location.meso.files[i], method="libcurl")
}

我每天将有大约 20 个位置提取数据。我想要做的是将计算华氏温度、GDD 等的任务应用于每个文件并分别保存每个文件的输出。

这是我目前拥有的以下代码。

files <- list.files(pattern="*.txt", full.names=TRUE, recursive=FALSE)

  func <- for (i in 1:length(files)){
  df <- read.table(files[i], skip=10, stringsAsFactors = 
  FALSE)
  colnames(df) <- c("year", "day", "solrad", "maxC", 
  "minC", "precipmm")
  df$year <- as.f(df$year)
  df$day <- as.factor(df$day)
  df$maxF <- (df$maxC * (9/5) + 32)
  df$minF <- (df$minC * (9/5) + 32)
  df$GDD <- (((df$maxF + df$minF)/2)-50)
  df$GDD[df$GDD <= 0] <- 0
  df$GDD.cumulateive <- cumsum(df$GDD)
  df$precipmm.cumulative <- cumsum(df$precipmm)
  return(df)
  write.table(df, path="./output", quote=FALSE, 
  row.names=FALSE, col.names=TRUE)
}

data <- apply(files, func)

任何帮助将不胜感激。

-ML

【问题讨论】：

抱歉，我忘记添加错误了。 Error in get(as.character(FUN), mode = "function", envir = envir) : object 'func' of mode 'function' was not found
您不能分配 for 循环 - 您需要 func <- function(<arguments>) {<body>}。我建议该函数将单个文件名作为唯一参数，然后在循环中调用该函数，而不是在函数内部有一个循环。您可能也想更改输出路径——添加文件扩展名会很好，但最重要的是使其更改，而不是每次都覆盖同一个文件时间。我还建议使用lapply，而不是apply。 apply 用于矩阵。
欢迎来到 SO，梅森。扩展@Gregor-reinstateMonica 的评论，还可以在lapply() 中使用匿名函数，如下面我的回答所示。

标签： r function for-loop apply

【解决方案1】：

这是一种使用基础 R 和带有匿名函数的 lapply() 的方法来下载数据，将其读入数据框，将转换添加到华氏温度和累积降水量，然后写入输出文件。

首先，我们创建要下载数据的气象站列表

# list of 10 stations
stationList <- c("NE3065","NE8745","NE0030","NE0050","NE0130",
                 "NE0245","NE0320","NE0355","NE0375","NE0420")

这里我们创建了两个 URL 片段，一个用于站点标识符之前的 URL 内容，另一个用于站点标识符之后的 URL 内容。

urlFragment1 <- "https://mesonet.agron.iastate.edu/cgi-bin/request/coop.py?network=NECLIMATE&stations="
urlFragment2 <- "&year1=2020&month1=1&day1=1&year2=2020&month2=12&day2=31&vars%5B%5D=gdd_50_86&model=apsim&what=view&delim=comma&gis=no&scenario_year"

接下来，我们创建输入和输出目录，一个用于存储下载的气候输入文件，另一个用于输出文件。

# create input and output file directories if they do not already exist 
if(!dir.exists("./data")) dir.create("./data")
if(!dir.exists("./data/output")) dir.create("./data/output")

lapply() 函数使用paste0() 将电台名称添加到我们在上面创建的 URL 片段中，使我们能够自动对每个输入文件进行下载和后续操作。

stationData <- lapply(stationList,function(x){
     theURL <-paste0(urlFragment1,x,urlFragment2)
     download.file(theURL,
                   paste0("./data/",x,".txt"),method="libcurl")
     df <- read.table(paste0("./data/",x,".txt"), skip=11, stringsAsFactors = 
                           FALSE)
     colnames(df) <- c("year", "day", "solrad", "maxC", 
                       "minC", "precipmm")
     df$year <- as.factor(df$year)
     df$day <- as.factor(df$day)
     df$maxF <- (df$maxC * (9/5) + 32)
     df$minF <- (df$minC * (9/5) + 32)
     df$GDD <- (((df$maxF + df$minF)/2)-50)
     df$GDD[df$GDD <= 0] <- 0
     df$GDD.cumulative <- cumsum(df$GDD)
     df$precipmm.cumulative <- cumsum(df$precipmm)
     df$station <- x
     write.table(df,file=paste0("./data/output/",x,".txt"), quote=FALSE, 
                 row.names=FALSE, col.names=TRUE)
     df
})
# add names to the data frames returned by lapply()
names(stationData) <- stationList

...以及输出，一个目录，其中包含stationList 对象中列出的每个站的一个文件。

最后，这里是已写入./data/output/NE3065.txt 文件的数据。

year day solrad maxC minC precipmm maxF minF GDD GDD.cumulateive precipmm.cumulative station
2020 1 8.992 2.2 -5 0 35.96 23 0 0 0 NE3065
2020 2 9.604 5.6 -3.9 0 42.08 24.98 0 0 0 NE3065
2020 3 4.933 5.6 -3.9 0 42.08 24.98 0 0 0 NE3065
2020 4 8.699 3.9 -7.2 0 39.02 19.04 0 0 0 NE3065
2020 5 9.859 6.1 -7.8 0 42.98 17.96 0 0 0 NE3065
2020 6 10.137 7.2 -5 0 44.96 23 0 0 0 NE3065
2020 7 8.754 6.1 -4.4 0 42.98 24.08 0 0 0 NE3065
2020 8 10.121 7.8 -5 0 46.04 23 0 0 0 NE3065
2020 9 9.953 7.2 -5 0 44.96 23 0 0 0 NE3065
2020 10 8.905 7.2 -5 0 44.96 23 0 0 0 NE3065
2020 11 0.416 -3.9 -15.6 2.29 24.98 3.92 0 0 2.29 NE3065
2020 12 10.694 -4.4 -16.1 0 24.08 3.02 0 0 2.29 NE3065
2020 13 1.896 -4.4 -11.1 0.51 24.08 12.02 0 0 2.8 NE3065
2020 14 0.851 0 -7.8 0 32 17.96 0 0 2.8 NE3065
2020 15 11.043 -1.1 -8.9 0 30.02 15.98 0 0 2.8 NE3065
2020 16 10.144 -2.8 -17.2 0 26.96 1.04 0 0 2.8 NE3065
2020 17 10.75 -5.6 -17.2 3.05 21.92 1.04 0 0 5.85 NE3065

请注意，输入文件中有 11 行标题数据，因此必须将 read.table() 中的 skip= 参数设置为 11，而不是 OP 中使用的 10。

增强代码

匿名函数中的最后一行将数据帧返回给父环境，从而得到一个存储在stationData 对象中的10 个数据帧的列表。由于我们为每个数据帧中的一列分配了台站名称，因此我们可以将数据帧组合成一个数据帧进行后续分析，使用do.call() 和rbind() 如下。

combinedData <- do.call(rbind,stationData)

由于此代码在 1 月 17 日运行，因此生成的数据框包含 170 个观测值，即我们下载了其数据的 10 个站点中的每个站点的 17 个观测值。

此时可以对数据进行分站分析，比如求各站的年均降水量。

> aggregate(precipmm ~ station,combinedData,mean)
   station   precipmm
1   NE0030 0.01470588
2   NE0050 0.56764706
3   NE0130 0.32882353
4   NE0245 0.25411765
5   NE0320 0.28411765
6   NE0355 1.49411765
7   NE0375 0.55235294
8   NE0420 0.13411765
9   NE3065 0.34411765
10  NE8745 0.47823529
>

【讨论】：

这很完美！一种非常优雅的方法来分割 URL 并引入后续站点 ID。这个解决方案节省了我研究如何简化程序的时间。干杯！
@Mason_Lien - 感谢您的反馈，梅森。随着您对 R 的熟悉，您会发现它非常强大，可用于自动执行各种数据处理和分析任务。
我期待扩展我的编程能力！这是我正在协助的项目的第一步，当地天气数据将成为预测作物生长阶段、何时收割田地等的预测模型的一部分。

【解决方案2】：

您可以安装 tidyverse 库，而不是使用 base R。 https://www.tidyverse.org/ 您可以在其中使用将链接加载到数据框中作为 tsv(tab 分隔值) 使用 read_tsv 函数。

dataframe<-read_tsv(url("http://some.where.net/"))

然后在R中创建一个循环并进行计算

something<-c('link1','link2') #vector in R
for(i in someting){
 #make sure to indent with one space
}

最后，您使用

将数据框保存到文件中

write_csv(dataframe, file = "c:\\myname\\yourfile.csv")

【讨论】：