【发布时间】:2017-05-24 16:08:08
【问题描述】:
所以我目前正在制作一个谷歌浏览器扩展程序,每当我所有课程的新成绩发布到我的大学成绩册时都会通知我,所以目前我正在尝试迭代地抓取和抓取 URL 并将其与最后一次迭代(...?对此的建议将不胜感激!),目前当我使用 request() 函数时,即使使用异步,该函数当前返回未定义的响应和正文,并给了我另一个奇怪的东西如果我尝试 console.log 所有这些错误。
这是我之后遇到的错误:
bundle.js:24 Uncaught TypeError: Cannot read property 'headers' of undefined
at Request._callback (bundle.js:24)
at self.callback (bundle.js:54273)
at Request.EventEmitter.emit (bundle.js:95413)
at Request.start (bundle.js:54842)
at Request.end (bundle.js:55610)
at end (bundle.js:54652)
at bundle.js:54666
at Item.run (bundle.js:103974)
at drainQueue (bundle.js:103944)
这是我的代码(更改了网址,因此您看不到我学校的登录网址):
var Crawler = require("simplecrawler"),
url = require("url"),
cheerio = require("cheerio"),
request = require("request");
var initialURL = "https://www.fakeURL.com/";
var crawler = new Crawler(initialURL);
request("https://www.fakeURL.com/", {
// The jar option isn't necessary for simplecrawler integration, but it's
// the easiest way to have request remember the session cookie between this
// request and the next
jar: true,
mode: 'no-cors'
}, function(error, response, body) {
// Start by saving the cookies. We'll likely be assigned a session cookie
// straight off the bat, and then the server will remember the fact that
// this session is logged in as user "iamauser" after we've successfully
// logged in
crawler.cookies.addFromHeaders(response.headers["set-cookie"]);
// We want to get the names and values of all relevant inputs on the page,
// so that any CSRF tokens or similar things are included in the POST
// request
var $ = cheerio.load(body),
formDefaults = {},
// You should adapt these selectors so that they target the
// appropriate form and inputs
formAction = $("#login").attr("action"),
loginInputs = $("input");
// We loop over the input elements and extract their names and values so
// that we can include them in the login POST request
loginInputs.each(function(i, input) {
var inputName = $(input).attr("name"),
inputValue = $(input).val();
formDefaults[inputName] = inputValue;
});
// Time for the login request!
request.post(url.resolve(initialURL, formAction), {
// We can't be sure that all of the input fields have a correct default
// value. Maybe the user has to tick a checkbox or something similar in
// order to log in. This is something you have to find this out manually
// by logging in to the site in your browser and inspecting in the
// network panel of your favorite dev tools what parameters are included
// in the request.
form: Object.assign(formDefaults, {
username: "secretusername",
password: "secretpassword"
}),
// We want to include the saved cookies from the last request in this
// one as well
jar: true
}, function(error, response, body) {
// That should do it! We're now ready to start the crawler
crawler.interval = 10000 //600000 // 10 minutes
crawler.maxConcurrency = 1; // 1 active check at a time
crawler.maxDepth = 5;
crawler.start();
});
});
crawler.on("fetchcomplete", function(queueItem, responseBuffer, response) {
console.log("Fetched", queueItem.url, responseBuffer.toString());
});
// crawler.interval = 600000 // 10 minutes
// crawler.maxConcurrency = 1; // 1 active check at a time
// crawler.maxDepth = 5;
//
// crawler.start();
需要注意的一点是,我在请求中添加了“no-cors”模式,这样每当我测试这个时我就可以停止遇到 CORS 问题,但这可能是导致此问题的原因吗?
谢谢!
编辑:我正在使用 Browserify 在浏览器中使用 require() 东西。我无法从 bundle.js 发布实际代码,因为它非常长,不适合这里。只是想澄清一下。谢谢!
EDIT2:这是我尝试执行 console.log(error) 时得到的:
Error: Invalid value for opts.mode
at new module.exports (bundle.js:108605)
at Object.http.request (bundle.js:108428)
at Object.https.request (bundle.js:97056)
at Request.start (bundle.js:54843)
at Request.end (bundle.js:55613)
at end (bundle.js:54655)
at bundle.js:54669
at Item.run (bundle.js:103977)
at drainQueue (bundle.js:103947)
【问题讨论】:
-
尝试弄清楚
error的内容是什么,并检查response.status。您的 http 请求中似乎存在“一些错误”。没有更多信息,我只能说。 -
我尝试检查错误,但问题是它给了我这个:错误:opts.mode 的值无效(原始帖子中的完整跟踪)。而且我无法检查 response.status 因为响应未定义。
-
@OmarBaradei 那么,这个答案最终对你有帮助吗?
标签: javascript google-chrome-extension web-scraping cors web-crawler