用 rvest 设置 cookie答案

【问题标题】：Set cookies with rvest用 rvest 设置 cookie
【发布时间】：2018-01-22 20:36:30
【问题描述】：

我想以编程方式导出this website 上的可用记录。要手动执行此操作，我将导航到该页面，单击导出，然后选择 csv。

我尝试从导出按钮复制链接，只要我有 cookie 就可以使用（我相信）。因此 wget 或 httr 请求将导致 html 站点而不是文件。

我找到了some help from an issue on the rvest github repo，但最终我无法像问题制造者那样真正弄清楚如何使用对象来保存 cookie 并在请求中使用它。

这里是我所在的位置：

library(httr)
library(rvest)

apoc <- html_session("https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx")
headers <- headers(apoc)

GET(url = "https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=False&exportFormat=CSV&isExport=True", 
    add_headers(headers)) # how can I take the output from headers in httr and use it as an argument in GET from httr?

我已经检查了 robots.txt，这是允许的。

【问题讨论】：

我发现的关于保存cookie的问题和答案一直使用Rselenium。要求您的程序驱动浏览器。我有兴趣了解其他途径。
我喜欢 Rselenium，但我一直不愿意在这种情况下使用它。

标签： r rvest

【解决方案1】：

您可以在 GET https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx 时从标题中获取 __VIEWSTATE 和 __VIEWSTATEGENERATOR，然后在后续的 POST 查询和 GET csv 中重复使用这些 __VIEWSTATE 和 __VIEWSTATEGENERATOR。

options(stringsAsFactors=FALSE)
library(httr)
library(curl)
library(xml2)

url <- 'https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx'

#get session headers
req <- GET(url)
req_html <- read_html(rawToChar(req$content))
fields <- c("__VIEWSTATE","__VIEWSTATEGENERATOR")
viewheaders <- lapply(fields, function(x) {
    xml_attr(xml_find_first(req_html, paste0(".//input[@id='",x,"']")), "value")
})
names(viewheaders) <- fields

#post request. you can get the list of form fields using tools like Fiddler
params <- c(viewheaders,
    list(
        "M$ctl19"="M$UpdatePanel|M$C$csfFilter$btnExport",
        "M$C$csfFilter$ddlNameType"="Any",
        "M$C$csfFilter$ddlField"="Elections",
        "M$C$csfFilter$ddlReportYear"="2017",
        "M$C$csfFilter$ddlStatus"="Default",
        "M$C$csfFilter$ddlValue"=-1,
        "M$C$csfFilter$btnExport"="Export"))
resp <- POST(url, body=params, encode="form")
print(resp$status_code)
resptext <- rawToChar(resp$content)
#writeLines(resptext, "apoc.html")

#get response i.e. download csv
url <- "https://aws.state.ak.us//ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=True&exportFormat=CSV&isExport=True"
req <- GET(url, body=params)
read.csv(text=rawToChar(req$content))

您可能需要使用输入/代码来获得您想要的准确信息。

这是另一个使用 RCurl 的类似解决方案： how-to-login-and-then-download-a-file-from-aspx-web-pages-with-r

【讨论】：