【发布时间】:2015-02-01 16:08:07
【问题描述】:
所以我编写了一个网络爬虫来从 walmart.com 中提取食品。这是我的蜘蛛。我似乎无法弄清楚为什么它不遵循左侧的链接,直到。它拉动主页然后结束。
我的预期目标是让它跟随左侧弹出栏上的所有链接,然后从这些页面中提取每个食物项目。
我什至尝试只使用 allow=() 以便它跟随页面上的每个链接,但这仍然不起作用。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle
from walmart_scraper.items import GroceryItem
class WalmartFoodSpider(CrawlSpider):
name = "walmart_scraper"
allowed_domains = ["www.walmart.com"]
start_urls = ["http://www.walmart.com/cp/976759"]
rules = (Rule(sle(restrict_xpaths=('//div[@class="lhn-menu-flyout-inner lhn-menu-flyout-2col"]/ul[@class="block-list"]/li/a',)),callback='parse',follow=True),)
items_list_xpath = '//div[@class="js-tile tile-grid-unit"]'
item_fields = {'title': './/a[@class="js-product-title"]/h3[@class="tile-heading"]/div',
'image_url': './/a[@class="js-product-image"]/img[@class="product-image"]/@src',
'price': './/div[@class="tile-price"]/div[@class="item-price- container"]/span[@class="price price-display"]|//div[@class="tile-price"]/div[@class="item-price- container"]/span[@class="price price-display price-not-available"]',
'category': '//nav[@id="breadcrumb-container"]/ol[@class="breadcrumb-list"]/li[@class="js-breadcrumb breadcrumb "][2]/a',
'subcategory': '//nav[@id="breadcrumb-container"]/ol[@class="breadcrumb-list"]/li[@class="js-breadcrumb breadcrumb active"]/a',
'url': './/a[@class="js-product-image"]/@href'}
def parse(self, response):
selector = HtmlXPathSelector(response)
# iterate over deals
for item in selector.select(self.items_list_xpath):
loader = XPathItemLoader(GroceryItem(), selector=item)
# define processors
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
# iterate over fields and add xpaths to the loader
for field, xpath in self.item_fields.iteritems():
loader.add_xpath(field, xpath)
yield loader.load_item()
【问题讨论】:
-
这发生在我身上,因为我修改了子类 Spider 的基本模板。使用爬网模板生成工作的 CrawlSpider:
scrapy genspider --template crawl spider_name allowed_domain。
标签: python-2.7 web-scraping scrapy screen-scraping