【发布时间】:2015-07-23 06:25:04
【问题描述】:
我对scrapy很陌生,并且已经构建了一些蜘蛛。 我正在尝试从这个page 中抓取评论。到目前为止,我的蜘蛛会抓取第一页并抓取这些项目,但在分页时它不会跟随链接。
我知道发生这种情况是因为它是一个 Ajax 请求,但它是一个 POST 而不是一个 GET,我是关于这些的新手,但我读过 this。我已阅读此帖子here 并按照“迷你教程”从似乎是的响应中获取 url
http://www.pcguia.pt/category/reviews/sorter=recent&location=&loop=main+loop&action=sort&view=grid&columns=3&paginated=2¤tquery%5Bcategory_name%5D=reviews
但是当我尝试在浏览器上打开它时,它说
"Página nao encontrada"="未找到页面"
到目前为止,我的想法是否正确,我错过了什么?
编辑:我的蜘蛛:
import scrapy
import json
from scrapy.http import FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from pcguia.items import ReviewItem
class PcguiaSpider(scrapy.Spider):
name = "pcguia" #spider name to call in terminal
allowed_domains = ['pcguia.pt'] #the domain where the spider is allowed to crawl
start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1'] #url from which the spider will start crawling
page_incr = 1
pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'
def parse(self, response):
sel = Selector(response)
if self.page_incr > 1:
json_data = json.loads(response.body)
sel = Selector(text=json_data.get('content', ''))
hxs = Selector(response)
item_pub = ReviewItem()
item_pub['date']= hxs.xpath('//span[@class="date"]/text()').extract() # is in the format year-month-dayThours:minutes:seconds-timezone ex: 2015-03-31T09:40:00-0700
item_pub['title'] = hxs.xpath('//title/text()').extract()
#pagination code starts here
# if page has content
if sel.xpath('//div[@class="panel-wrapper"]'):
self.page_incr +=1
formdata = {
'sorter':'recent',
'location':'main loop',
'loop':'main loop',
'action':'sort',
'view':'grid',
'columns':'3',
'paginated':str(self.page_incr),
'currentquery[category_name]':'reviews'
}
yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
else:
return
yield item_pub
输出:
2015-05-12 14:53:45+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: pcguia)
2015-05-12 14:53:45+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-05-12 14:53:45+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'pcguia.spiders', 'SPIDER_MODULES': ['pcguia.spiders'], 'BOT_NAME': 'pcguia'}
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled item pipelines:
2015-05-12 14:53:45+0100 [pcguia] INFO: Spider opened
2015-05-12 14:53:45+0100 [pcguia] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-12 14:53:45+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6033
2015-05-12 14:53:45+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6090
2015-05-12 14:53:45+0100 [pcguia] DEBUG: Crawled (200) <GET http://www.pcguia.pt/category/reviews/#paginated=1> (referer: None)
2015-05-12 14:53:45+0100 [pcguia] DEBUG: Scraped from <200 http://www.pcguia.pt/category/reviews/>
{'date': '',
'title': [u'Reviews | PCGuia'],
}
2015-05-12 14:53:47+0100 [pcguia] DEBUG: Crawled (200) <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php> (referer: http://www.pcguia.pt/category/reviews/)
2015-05-12 14:53:47+0100 [pcguia] DEBUG: Scraped from <200 http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
{'date': ''
'title': ''
}
【问题讨论】:
-
你从哪里获取日期??我在那个日期找不到任何令人满意的 xpath 吗?标题是指评论标题对吗?好像你已经采取了页面标题。发布您想要获取的可能输出
-
我从这里获取日期:pcguia.pt/desktops/asus-rog-gr8 xpath '//span[@class="date"]/text()' 指向 ' Publicado a 10 Dezembro, 2014 '
-
我已经更新了代码检查答案
标签: ajax post pagination xmlhttprequest scrapy