【发布时间】:2014-03-11 04:27:45
【问题描述】:
SourceForge Research Data Archive (SRDA) 是我论文研究的数据源之一。我在调试以下与 SRDA 数据收集相关的问题时遇到了困难。
从 SRDA 收集数据需要身份验证,然后通过 SQL 查询提交 Web 表单。成功处理查询后,系统会生成一个带有查询结果的文本文件。在为 SRDA 数据收集测试我的 R 代码时,我更改了 SQL 请求以确保正在重新生成结果文件。但是,我发现文件内容保持不变(对应于先前的查询)。我认为文件内容缺乏刷新可能是由于身份验证或查询表单提交失败。以下是代码(https://github.com/abnova/diss-floss/blob/master/import/getSourceForgeData.R)的调试输出:
make importSourceForge
Rscript --no-save --no-restore --verbose getSourceForgeData.R
running
'/usr/lib/R/bin/R --slave --no-restore --no-save --no-restore --file=getSourceForgeData.R'
Loading required package: RCurl
Loading required package: methods
Loading required package: bitops
Loading required package: digest
Retrieving SourceForge data...
Checking request "SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"...
* About to connect() to zerlot.cse.nd.edu port 80 (#0)
* Trying 129.74.152.47... * connected
> POST /mediawiki/index.php?title=Special:Userlogin&action=submitlogin&type=login HTTP/1.1
Host: zerlot.cse.nd.edu
Accept: */*
Content-Length: 37
Content-Type: application/x-www-form-urlencoded
* upload completely sent off: 37out of 37 bytes
< HTTP/1.1 200 OK
< Date: Tue, 11 Mar 2014 03:49:04 GMT
< Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch
< X-Powered-By: PHP/5.2.4-2ubuntu5.25
* Added cookie wiki_db_session="c61...a3c" for domain zerlot.cse.nd.edu, path /, expire 0
< Set-Cookie: wiki_db_session=c61...a3c; path=/
< Content-language: en
< Vary: Accept-Encoding,Cookie
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Cache-Control: private, must-revalidate, max-age=0
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
* Connection #0 to host zerlot.cse.nd.edu left intact
[1] "Before second postForm()"
* Re-using existing connection! (#0) with host zerlot.cse.nd.edu
* Connected to zerlot.cse.nd.edu (129.74.152.47) port 80 (#0)
> POST /cgi-bin/form.pl HTTP/1.1
Host: zerlot.cse.nd.edu
Accept: */*
Cookie: wiki_db_session=c61...a3c
Content-Length: 129
Content-Type: application/x-www-form-urlencoded
* upload completely sent off: 129out of 129 bytes
< HTTP/1.1 500 Internal Server Error
< Date: Tue, 11 Mar 2014 03:49:04 GMT
< Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch
< Vary: Accept-Encoding
< Connection: close
< Transfer-Encoding: chunked
< Content-Type: text/html
<
* Closing connection #0
Error: Internal Server Error
Execution halted
make: *** [importSourceForge] Error 1
我尝试使用调试输出以及来自 Firefox 嵌入式开发者工具的网络协议分析器来解决这个问题,但到目前为止还没有取得多大成功。非常感谢任何建议和帮助。
更新:
if (!require(RCurl)) install.packages('RCurl')
if (!require(digest)) install.packages('digest')
library(RCurl)
library(digest)
# Users must authenticate to access Query Form
SRDA_HOST_URL <- "http://zerlot.cse.nd.edu"
SRDA_LOGIN_URL <- "/mediawiki/index.php?title=Special:Userlogin"
SRDA_LOGIN_REQ <- "&action=submitlogin&type=login"
# SRDA URL that Query Form sends POST requests to
SRDA_QUERY_URL <- "/cgi-bin/form.pl"
# SRDA URL that Query Form sends POST requests to
SRDA_QRESULT_URL <- "/qresult/blekh/blekh.txt"
# Parameters for result's format
DATA_SEP <- ":" # data separator
ADD_SQL <- "1" # add SQL to file
curl <<- getCurlHandle()
srdaLogin <- function (loginURL, username, password) {
curlSetOpt(curl = curl, cookiejar = 'cookies.txt',
ssl.verifyhost = FALSE, ssl.verifypeer = FALSE,
followlocation = TRUE, verbose = TRUE)
params <- list('wpName1' = username, 'wpPassword1' = password)
if(url.exists(loginURL)) {
reply <- postForm(loginURL, .params = params, curl = curl,
style = "POST")
#if (DEBUG) print(reply)
info <- getCurlInfo(curl)
return (ifelse(info$response.code == 200, TRUE, FALSE))
}
else {
error("Can't access login URL!")
}
}
srdaConvertRequest <- function (request) {
return (list(select = "*",
from = "sf1104.users a, sf1104.artifact b",
where = "b.artifact_id = 304727"))
}
srdaRequestData <- function (requestURL, select, from, where, sep, sql) {
params <- list('uitems' = select,
'utables' = from,
'uwhere' = where,
'useparator' = sep,
'append_query' = sql)
if(url.exists(requestURL)) {
reply <- postForm(requestURL, .params = params, #.opts = opts,
curl = curl, style = "POST")
}
}
srdaGetData <- function(request) {
resultsURL <- paste(SRDA_HOST_URL, SRDA_QRESULT_URL,
collapse="", sep="")
results.query <- readLines(resultsURL, n = 1)
return (ifelse(results.query == request, TRUE, FALSE))
}
getSourceForgeData <- function (request) {
# Construct SRDA login and query URLs
loginURL <- paste(SRDA_HOST_URL, SRDA_LOGIN_URL, SRDA_LOGIN_REQ,
collapse="", sep="")
queryURL <- paste(SRDA_HOST_URL, SRDA_QUERY_URL, collapse="", sep="")
# Log into the system
if (!srdaLogin(loginURL, USER, PASS))
error("Login failed!")
rq <- srdaConvertRequest(request)
srdaRequestData(queryURL,
rq$select, rq$from, rq$where, DATA_SEP, ADD_SQL)
if (!srdaGetData(request))
error("Data collection failed!")
}
message("\nTesting SourceForge data collection...\n")
getSourceForgeData("SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727")
# clean up
close(curl)
更新 2(无功能版本):
if (!require(RCurl)) install.packages('RCurl')
library(RCurl)
# Users must authenticate to access Query Form
SRDA_HOST_URL <- "http://zerlot.cse.nd.edu"
SRDA_LOGIN_URL <- "/mediawiki/index.php?title=Special:Userlogin"
SRDA_LOGIN_REQ <- "&action=submitlogin&type=login"
# SRDA URL that Query Form sends POST requests to
SRDA_QUERY_URL <- "/cgi-bin/form.pl"
# SRDA URL that Query Form sends POST requests to
SRDA_QRESULT_URL <- "/qresult/blekh/blekh.txt"
# Parameters for result's format
DATA_SEP <- ":" # data separator
ADD_SQL <- "1" # add SQL to file
message("\nTesting SourceForge data collection...\n")
curl <- getCurlHandle()
curlSetOpt(curl = curl, cookiejar = 'cookies.txt',
ssl.verifyhost = FALSE, ssl.verifypeer = FALSE,
followlocation = TRUE, verbose = TRUE)
# === Authentication ===
loginParams <- list('wpName1' = USER, 'wpPassword1' = PASS)
loginURL <- paste(SRDA_HOST_URL, SRDA_LOGIN_URL, SRDA_LOGIN_REQ,
collapse="", sep="")
if (url.exists(loginURL)) {
postForm(loginURL, .params = loginParams, curl = curl, style = "POST")
info <- getCurlInfo(curl)
message("\nLogin results - HTTP status code: ", info$response.code, "\n\n")
} else {
error("\nCan't access login URL!\n\n")
}
# === Data collection ===
# Previous query was: "SELECT * FROM sf0305.users WHERE user_id < 100"
query <- list(select = "*",
from = "sf1104.users a, sf1104.artifact b",
where = "b.artifact_id = 304727")
getDataParams <- list('uitems' = query$select,
'utables' = query$from,
'uwhere' = query$where,
'useparator' = DATA_SEP,
'append_query' = ADD_SQL)
queryURL <- paste(SRDA_HOST_URL, SRDA_QUERY_URL, collapse="", sep="")
if(url.exists(queryURL)) {
postForm(queryURL, .params = getDataParams, curl = curl, style = "POST")
resultsURL <- paste(SRDA_HOST_URL, SRDA_QRESULT_URL,
collapse="", sep="")
results.query <- readLines(resultsURL, n = 1)
request <- paste(query$select, query$from, query$where)
if (results.query == request)
message("\nData request is successful, SQL query: ", request, "\n\n")
else
message("\nData request failed, SQL query: ", request, "\n\n")
} else {
error("\nCan't access data query URL!\n\n")
}
close(curl)
更新 3(服务器端调试)
最后,我能够与负责系统的人员取得联系,他帮助我将问题缩小到 cookie 管理 恕我直言。这是错误日志记录,对应运行我的代码:
[2014 年 3 月 21 日星期五 15:33:14] [错误] [客户端 54.204.180.203] [3 月 21 日星期五 15:33:14 2014] form.pl: /tmp/sess_3e55593e436a013597cd320e4c6a2fac: 在 /var/www/cgi-bin/form.pl 第 43 行
以下是产生该错误的服务器端脚本(Perl)的sn-p(脚本中的第1行是bash解释器指令,因此报告了行号43 很可能是第 44 行):
42 if (-e "/tmp/sess_$file") {
43 $session = PHP::Session->new($cgi->cookie("$session_name"));
44 $user_id = $session->get('wsUserID');
45 $user_name = $session->get('wsUserName');
以下是会话信息(1)认证后和(2)提交数据请求后,通过追踪获得手动身份验证和手动数据请求表单提交:
(1)“wiki_dbUserID=449;过期=周日,2014 年 4 月 20 日 21:04:14 GMT; 路径=/wiki_dbUserName=Blekh;过期=星期日,2014 年 4 月 20 日 21:04:14 GMT; 路径=/wiki_dbToken=已删除; expires=2013 年 3 月 21 日星期四 21:04:13 GMT"
(2) wiki_db_session=aaed058f97059174a59effe44b137cbc; _ga=GA1.2.2065853334.1395410153; EDSSID=e24ff5ed891c28c61f2d1f8dec424274; wiki_dbUserName=Blekh; wiki_dbLoggedOut=20140321210314; wiki_dbUserID=449
如果能帮我解决我的代码问题,我们将不胜感激!
【问题讨论】:
-
您需要展示您的 R 代码 - 您如何确保在请求之间保留 cookie? (默认情况下,httr 会这样做,但 RCurl 不会)
-
@hadley:我在 GitHub 上提供了指向我的源代码的链接(就在输出之前)。
-
那是很多代码。我建议将其删除为一个简单的可重现测试用例。
-
@hadley:根据您的建议,我正在用测试用例更新我的问题。它并没有我想要的那么小,但我尽了最大的努力来最小化和简化代码,而不会与我的现实生活场景有太大的偏差。我已经测试过了,结果是一样的。一件有趣的事情是我发现我的 SQL 请求中缺少空格(已修复),这应该触发了适当的 SQL 语法消息。对我来说,这意味着查询甚至没有达到被解析的程度。向您发送 USER 和 PASS,因为我不想危及对系统的访问。
-
@请忽略我关于 SQL 查询语法的注释 - 很好。
标签: r forms authentication web-scraping rcurl