【问题标题】:RCurl not working in download of URL contentRCurl 在下载 URL 内容时不起作用
【发布时间】:2014-12-15 21:43:44
【问题描述】:

无法下载页面。这是我得到的错误:

Error in which(value == defs) : 
  argument "code" is missing, with no default

这是我的代码:

require(RCurl)
require(XML)

ok <- "http://www.okcupid.com/match?filter1=0,34&filter2=2,22,40&filter3=3,5&filter4=5,3600&filter5=9,486&filter6=1,1&locid=4265540&lquery=San%20Francisco,%20California&timekey=1&matchOrderBy=MATCH&custom_search=0&fromWhoOnline=0&mygender=m&update_prefs=1&sort_type=0&sa=1&using_saved_search=&count=50"

okc <- getURL(ok, encoding="UTF-8") #Download the page
okcHTML <- htmlParse(okc, asText = TRUE, encoding = "utf-8")

【问题讨论】:

    标签: xml r rcurl rvest


    【解决方案1】:

    如果你愿意生活在 Hadleyverse 的前沿,rvest 可以很好地处理这个问题:

    library(rvest)
    
    ok_search <- "https://www.okcupid.com/match?filter1=0,34&filter2=2,22,40&filter3=3,5&filter4=5,3600&filter5=9,486&filter6=1,1&locid=4265540&lquery=San%20Francisco,%20California&timekey=1&matchOrderBy=MATCH&custom_search=0&fromWhoOnline=0&mygender=m&update_prefs=1&sort_type=0&sa=1&using_saved_search=&count=50"
    
    pg <- html_session(ok_search)
    pg %>% html_nodes("div.profile_info") %>% html_text()
    
    ##  [1] "  phenombom   32·San Francisco, CA  "        "  sylvea   24·San Francisco, CA  "          
    ##  [3] "  haafu   40·San Francisco, CA  "            "  Rebamania   31·San Francisco, CA  "       
    ##  [5] "  Brilikedacheese   26·San Francisco, CA  "  "  cloudhunteress   23·San Francisco, CA  "  
    ##  [7] "  Lizzieisdizzy   28·San Francisco, CA  "    "  liddybird80   34·San Francisco, CA  "     
    ##  [9] "  wander_found   32·San Francisco, CA  "     "  Crunchyisinabox   31·San Francisco, CA  " 
    ...
    

    我会解释为什么直接RCurlrvest 包裹RCurl)不起作用。

    更新

    更深一层并使用httr(另一个RCurl抽象):

    library(httr)
    library(XML)
    
    res <- GET(ok_search)
    ok_html <- content(res, as="parsed")
    xpathSApply(ok_html, "//div[@class='profile_info']", xmlValue)
    

    返回的结果与上面相同,所以也可以正常工作。

    更新 / 已解决

    library(RCurl)
    library(XML)
    
    okc <- getURL(ok,  followlocation=TRUE)
    ok_html <- htmlParse(okc)
    xpathSApply(ok_html , "//div[@class='profile_info']", xmlValue)
    

    您需要添加followlocation=TRUE。原始 URL 导致 302 响应(服务器正在发送重定向),RCurl 默认不会遵循这些响应,但似乎 httrrvest 确保默认设置该参数。

    您可以使用getURL 上的verbose=TRUE 参数将响应视为控制台消息:

    ## * Adding handle: conn: 0x114ade000
    ## * Adding handle: send: 0
    ## * Adding handle: recv: 0
    ## * Curl_addHandleToPipeline: length: 1
    ## * - Conn 12 (0x114ade000) send_pipe: 1, recv_pipe: 0
    ## * About to connect() to www.okcupid.com port 80 (#12)
    ## *   Trying 198.41.209.131...
    ## * Connected to www.okcupid.com (198.41.209.131) port 80 (#12)
    ## > GET /match?filter1=0,34&filter2=2,22,40&filter3=3,5&filter4=5,3600&filter5=9,486&filter6=1,1&locid=4265540&lquery=San%20Francisco,%20California&timekey=1&matchOrderBy=MATCH&custom_search=0&fromWhoOnline=0&mygender=m&update_prefs=1&sort_type=0&sa=1&using_saved_search=&count=50 HTTP/1.1
    ## User-Agent: curl/7.30.0 Rcurl/1.95.4.3
    ## Host: www.okcupid.com
    ## Accept: */*
    ## 
    ## < HTTP/1.1 302
    ## < Date: Mon, 20 Oct 2014 20:07:12 GMT
    ## < Content-Type: application/octet-stream
    ## < Transfer-Encoding: chunked
    ## < Connection: keep-alive
    ## < Set-Cookie: __cfduid=d0d55f2c9c990d97b0d02dba7148881741413835631999; expires=Mon, 23-Dec-2019 23:50:00 GMT; path=/; domain=.okcupid.com; HttpOnly
    ## < X-OKWS-Version: OKWS/3.1.30.2
    ## < Location: https://www.okcupid.com/match?filter1=0,34&filter2=2,22,40&filter3=3,5&filter4=5,3600&filter5=9,486&filter6=1,1&locid=4265540&lquery=San%20Francisco,%20California&timekey=1&matchOrderBy=MATCH&custom_search=0&fromWhoOnline=0&mygender=m&update_prefs=1&sort_type=0&sa=1&using_saved_search=&count=50
    ## < P3P: CP="NOI CURa ADMa DEVa TAIa OUR BUS IND UNI COM NAV INT", policyref="http://www.okcupid.com/w3c/p3p.xml"
    ## < X-XSS-Protection: 1; mode=block
    ## < Set-Cookie: guest=10834912674894888479; Expires=Tue, 20 Oct 2015 20:07:12 GMT; Path=/; Domain=okcupid.com; HttpOnly
    ## * Server cloudflare-nginx is not blacklisted
    ## < Server: cloudflare-nginx
    ## < CF-RAY: 17c7d71bf1880412-EWR
    ## < 
    ## * Connection #12 to host www.okcupid.com left intact
    

    在调试此类问题时非常有用。您也可以将verbose() 参数用于httrrvest URL 检索函数。

    【讨论】:

    • 我在尝试这段代码时收到以下错误:> okc
    • 这是系统上过期证书的标志。您可以将ssl.verifypeer = FALSE 传递给getURLgetURLContent,但这会让@Hadley 杀死另一只小猫。
    • RCurl FAQ 是一个很好的书签,可用于网页抓取。
    • 我真的更喜欢 css 选择器:html_nodes("div.profile_info") - 少打字!
    • 完全同意,@hadley,但奥斯汀在 OP 中有 XML 库,我只是想解决他的 RCurl 问题,而不将其添加到混合中(我只包括了 @ 987654348@ 表明它可以避免第二个问题:-)
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-09-20
    • 2011-08-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多