【问题标题】:RCurl getURL with loop - link to a PDF kills looping带有循环的 RCurl getURL - 链接到 PDF 会杀死循环
【发布时间】:2014-10-17 10:17:54
【问题描述】:

我已经困惑了很久,似乎无法弄清楚如何解决它。最容易提供工作虚拟代码:

require(RCurl)
require(XML)

#set a bunch of options for curl
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Firefox/23.0" 
curl = getCurlHandle()
curlSetOpt(
  cookiejar = 'cookies.txt' ,
  useragent = agent,
  followlocation = TRUE ,
  autoreferer = TRUE ,
  httpauth = 1L, # "basic" http authorization version -- this seems to make a difference for India servers
  curl = curl
)


list1 <- c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933277.cms')

#note list2 has a new link inserted in 2nd position; this is the link that kills the following getURL calls
list2 <- c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933277.cms')



for ( i in seq( list1 ) ){
  print(list1[i])
  html <-
    try( getURL(
      list1[i],
      maxredirs = as.integer(20),
      followlocation = TRUE,
      curl = curl
    ),TRUE)
  if (class (html) == "try-error") {
    print(paste("error accessing",list1[i]))
    rm(html)
    gc()
    next
  } else {
    print('success')
  }
}


gc()

for ( i in seq( list2 ) ){
  print(list2[i])
  html <-
    try( getURL(
      list2[i],
      maxredirs = as.integer(20),
      followlocation = TRUE,
      curl = curl
    ),TRUE)
  if (class (html) == "try-error") {
    print(paste("error accessing",list2[i]))
    rm(html)
    gc()
    next
  } else {
    print('success')
  }
}

这应该能够在安装了 RCurl 和 XML 库的情况下运行。关键是当我将http://timesofindia.indiatimes.com//articleshow/2933019.cms 插入列表中的第二个位置时,它会终止循环其余部分的成功(其他链接相同)。当链接包含 PDF(查看)时,会发生这种情况(在这种情况和其他情况下始终如此)。

关于如何解决此问题以便获取包含 PDF 的链接不会杀死我的循环的任何想法?如您所见,我试图清除可能有问题的对象,gc() 到处都是,等等,但我不知道为什么 PDF 会杀死我的循环。

谢谢!

只是检查一下,这是我的两个 for 循环的输出:

    #[1] "http://timesofindia.indiatimes.com//articleshow/2933112.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933131.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933209.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933277.cms"
    #[1] "success"

    #[1] "http://timesofindia.indiatimes.com//articleshow/2933112.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933019.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933019.cms"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933131.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933131.cms"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933209.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933209.cms"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933277.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933277.cms"

【问题讨论】:

    标签: r for-loop next rcurl geturl


    【解决方案1】:

    您可能会发现使用 httr 更容易。它包装 RCurl 并默认设置您需要的选项。这是与 httr 等效的代码:

    require(httr)
    
    urls <- c(
      'http://timesofindia.indiatimes.com//articleshow/2933112.cms',
      'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
      'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
      'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
      'http://timesofindia.indiatimes.com//articleshow/2933277.cms'
    )
    
    responses <- lapply(urls, GET)
    sapply(responses, http_status)
    
    sapply(responses, function(x) headers(x)$`content-type`)
    

    【讨论】:

    • 感谢您---httr 很高兴知道。此外,您的回答告诉我,可以确定 URL 中包含的文档类型,我现在正在使用 getURLContent() 跳过 PDF。
    猜你喜欢
    • 2021-08-20
    • 1970-01-01
    • 2018-03-23
    • 2020-03-06
    • 2021-01-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多