【问题标题】:In download.file ... cannot open URL ... HTTP status was '404 Not Found'在 download.file ... 无法打开 URL ... HTTP 状态为 '404 Not Found'
【发布时间】:2018-03-15 00:27:09
【问题描述】:

感谢 StackOverflow,我能够使用以下代码在公共网站上下载一系列照片。

urls <- c("https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0090/13", 
"https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0089/13", 
"https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0088/13", 
"https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0087/13", 
"https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0086/13"
)

for (url in 1:length(urls)) {

  print(url)
  webpage <- html_session(urls[url])
  link.titles <- webpage %>% html_nodes("img")
  img.url <- link.titles %>% html_attr("src")

  for(j in 1:length(img.url)){

    download.file(img.url[j], paste0(url,'.',j,".jpg"), mode = "wb")
  }

}

但是,某些链接不包含照片,因此返回 HTTP 状态错误并停止下载过程。

所以,我想插入一个if 命令并告诉 R 忽略/绕过那些不包含照片或“404 Not Found”错误的页面。然而,问题是,我不知道什么函数或命令会代表没有图像或“404 Not Found”错误的页面。任何建议,将不胜感激。

【问题讨论】:

标签: r


【解决方案1】:
library(purrr)
library(rvest)
library(httr)

urls <- c(
  "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0090/13", 
  "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0089/13", 
  "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0088/13", 
  "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0087/13", 
  "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0086/13"
)

sGET <- safely(GET)                                           # make a "safe" version of httr::GET

map(urls, read_html) %>%                                      # read each page
  map(html_nodes, "img") %>%                                  # extract img tags
  flatten() %>%                                               # convert to a simple list
  map_chr(html_attr, "src") %>%                               # extract the URL
  walk(~{                                                     # for each URL
    res <- sGET(.x)                                           # try to retrieve it
    if (!is.null(res$result)) {                               # if there were no fatal errors
      if (status_code(res$result) == 200) {                   # and, if found
        writeBin(content(res$result, as="raw"), basename(.x)) # save it to disk
      }
    }
  })

是一种替代的、实用的、“安全”的方式。

【讨论】:

  • 感谢分享!它不适用于“更大的集合”。但是,我肯定会以此作为参考:)
  • 究竟是什么不适合“更大”的套装?这是一个非常强大的解决方案,所以我很好奇具体的错误情况是什么。
  • 是的,当然,这是一个非常好的解决方案,感谢您的时间和精力。但是,它不适用于我正在使用的列表,其中包含大约 10,000 个 URL。我正在重新运行您的脚本,但它需要一段时间,所以一旦它完成运行就会发布错误。
  • 下载这么多图片时,最好换个成语。一种测试存在而不是使用curl::curl_fetch_multi()
【解决方案2】:

只需使用函数“try”:

urls <- c("https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0090/13", 
          "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0089/13", 
          "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0088/13", 
          "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0087/13", 
          "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0086/13"
)

for (url in 1:length(urls)) {

  print(url)
  webpage <- html_session(urls[url])
  link.titles <- webpage %>% html_nodes("img")
  img.url <- link.titles %>% html_attr("src")

  for(j in 1:length(img.url)){

    try(download.file(img.url[j], paste0(url,'.',j,".jpg"), mode = "wb")
        ,silent = TRUE)

}

另外,还可以添加条件“if”:

for (url in 1:length(urls)) {

  print(url)
  webpage <- html_session(urls[url])
  link.titles <- webpage %>% html_nodes("img")
  img.url <- link.titles %>% html_attr("src")

  for(j in 1:length(img.url)){

    try_download <- try(
      download.file(img.url[j], paste0(url,'.',j,".jpg"), mode = "wb")
      ,silent = TRUE)

    if(is(try_download,"try-error")){
      print(paste0("ERROR: ", img.url[j]))
    }else{
      print(paste0("Downloaded: ", img.url[j]))
    }

}

【讨论】:

    猜你喜欢
    • 2017-09-27
    • 1970-01-01
    • 1970-01-01
    • 2019-03-14
    • 1970-01-01
    • 2016-07-01
    • 1970-01-01
    • 2020-03-26
    • 1970-01-01
    相关资源
    最近更新 更多