【发布时间】:2014-07-13 14:16:04
【问题描述】:
我正在尝试从 R 中受密码保护的网站中抓取数据。四处阅读,似乎 httr 和 RCurl 包是使用密码身份验证进行抓取的最佳选择(我还研究了 XML 包) .
我要抓取的网站如下(您需要一个免费帐户才能访问完整页面): http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2
这是我的两次尝试(用我的用户名替换“用户名”,用我的密码替换“密码”):
#This returns "Status: 200" without the data from the page:
library(httr)
GET("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", authenticate("username", "password"))
#This returns the non-password protected preview (i.e., not the full page):
library(XML)
library(RCurl)
readHTMLTable(getURL("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", userpwd = "username:password"))
我查看了其他相关帖子(以下链接),但不知道如何将他们的答案应用于我的案例。
How to use R to download a zipped file from a SSL page that requires cookies
How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?
Reading information from a password protected site
R - RCurl scrape data from a password-protected site
http://www.inside-r.org/questions/how-scrape-data-password-protected-https-website-using-r-hold
【问题讨论】:
标签: xml r web-scraping rcurl httr