从R中的url列表下载多个pdf文件时出错答案

【问题标题】：Error when downloading multiple pdf files from list of urls in R从R中的url列表下载多个pdf文件时出错
【发布时间】：2021-02-08 02:06:52
【问题描述】：

我有一个 url 列表并尝试下载它们通过 lapply 链接到的 pdf。即使弹出下载栏，我也会收到以下消息并且文件未下载：

仅使用“destfile”参数的第一个元素尝试 URL“https://reliefweb.int/sites/reliefweb.int/files/resources/hno_car_2021_final_fr.pdf” 内容类型“应用程序/pdf”长度 22087482 字节 (21.1 MB) 已下载 21.1 MB

names<- lapply(pdf, basename) # get names
destination<- paste0 ("~/", names)
lapply(pdf,download.file, destfile=destination)

pdf
[[1]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/hno_car_2021_final_fr.pdf"

[[2]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/rnro_centralsahel_oct_2020_fr_web.pdf"

[[3]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/rnro_centralsahel_oct_2020_en_web.pdf"

[[4]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/hno_2020-final.pdf"

[[5]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/hno_light_2020-en_final_0.pdf"

[[6]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/20200701_HNO_CENTROAMERICA%20ADDENDUM%20ING.pdf"

[[7]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/20200706%20ONEPAGER%20HNO%20Centroame%CC%81rica%20ING.pdf"

[[8]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/cmr_hno_2020-revised_print.pdf"

[[9]]
[1] "https://reliefweb.int/sites/reliefweb.int/files/resources/20200616_HNO_CENTROAMERICA%20ADDENDUM.pdf"

【问题讨论】：

标签： r pdf web-scraping downloadfile

【解决方案1】：

我会立即处理所有事情。请参阅前 2 个文件的代码示例。

names = c("https://reliefweb.int/sites/reliefweb.int/files/resources/hno_car_2021_final_fr.pdf",
          "https://reliefweb.int/sites/reliefweb.int/files/resources/rnro_centralsahel_oct_2020_fr_web.pdf"
)


# making the filenames
downloaded = lapply(names, function(url){
  # extract the last part of the url to make the filename
  destination = unlist(strsplit(url, '/'))
  destination = destination[length(destination)]
  destination = paste0 ("~/", destination)
  # download the file
  download.file(url = url, destfile=destination, mode="wb")
  return(destination) # This is optional, just the see where the files are saved
})

# downloaded

# [[1]]
# [1] "~/hno_car_2021_final_fr.pdf"

# [[2]]
# [1] "~/rnro_centralsahel_oct_2020_fr_web.pdf"

【讨论】：

谢谢！我设法下载了文件，但知道为什么它不将文件保存在工作目录中吗？我能做些什么让它们保存在那里吗？
我遇到的另一个问题是，每次链接失败时代码都会停止（在我的实际数据中，我有 230 个链接，其中一些可能不再有效）我删除了那些不起作用的链接对初始列表进行子集化，但有没有办法将其作为函数的一部分自动化？
如果要将文件保存到工作目录，则需要明确指定：destination = paste0 (getwd(), '/', destination)。请注意，~（在paste0 ("~/", destination) 中）表示主目录而不是工作目录。
对于未下载的文件，请参阅stackoverflow.com/questions/50624864/…。您可能还对 recount 包 (rdrr.io/bioc/recount/man/download_retry.html) 中的 download_retry 函数感兴趣，该函数允许您重试失败的下载，因为任务可能因各种原因（例如互联网连接问题）而失败。