通常,网站从后端调用或使用第三方服务获取数据。
但在这种情况下,你抓取的原始数据包含在原生javascript语句中,然后导入regex模块帮助过滤或提取数据;最后,利用json模块解析并获取你想要的数据.
var tourcode ={
"id": 24588,
"title": "تور مشهد 22 دی 96 (از اصفهان)",
"slug": "تور-مشهد-22-دی-96-از-اصفهان",
....
"packages": {
"bundles": {
{
"308892": {
"id": 308892,
"hotels": [
{
"id": 1298,
"bundle_id": 308892,
"link": "https://lastsecond.ir/hotels/1298-mehr-reza",
"location_id": 410,
"location_name": "مشهد",
"name": "Mehr Reza hotel",
"grade": {
"id": 80,
"name": "هتل آپارتمان",
"icons": [
"fa-building"
],
"count": "0",
"singleIcon": "<i class=\"fa fa-building large-star\"> <label class=\"orange-text\"></label> </i>"
},
"decoratedGrade": "<div class=\"d-inline-block ltr hotelGrade\" data-toggle=\"tooltip\" data-placement=\"left\" title=\"هتل آپارتمان\"><i class=\"fa fa-building orange-text\"></i></div>",
"score": 0,
"imageUrl": "https://lastsecond.ir/site/images/placeholder/hotel.svg",
"reviewsCount": 0,
"decoratedScore": "<div class=\"hotelScore\"><div class=\"score\" style=\"width: 0%\"></div></div>",
"description": "صبحانه",
"service_id": 2,
"service": "bb",
"serviceName": "B.B",
"serviceDesc": "با صبحانه",
"ordering": "1"
}
],
"prices": {
"1": {
"1": "295000"
},
"2": {
"1": "370000"
},
"3": {
"1": "295000"
},
"4": {
"1": "240000"
}
}
}
}
...
}}
}
我发现了一篇关于如何Extract Data from Native JS statement 的好帖子供您参考。
假设使用scrapy shell操作
$ scrapy shell https://lastsecond.ir/tours/24588-%D8%AA%D9%88%D8%B1-%D9%85%D8%B4%D9%87%D8%AF-22-%D8%AF%DB%8C-96-%D8%A7%D8%B2-%D8%A7%D8%B5%D9%81%D9%87%D8%A7%D9%86
[scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
....
$ import re
$ import json
$ jsonstr = re.findall("var tourcode = (.+?);\n",response.body.decode('utf-8'),re.S)
$ jsonobj = json.loads(jsonstr[0])
# parse json object here