【发布时间】:2015-10-27 16:22:12
【问题描述】:
我是 scrappy 的新手,并试图按照示例 (link http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.VcFiAjBVhBc) 抓取 craiglist。
但是,每次运行我的代码,我只能获取页面上的第一条记录,而附件代码中的示例是这样的,它只包含每个页面上的第一条记录
link,title
/eby/npo/5155561393.html,Residential Administrator full time
/sfc/npo/5154403251.html,Sr. Director of Family Support Services
/eby/npo/5150280793.html,Veterans Program Internship
/eby/npo/5157174843.html,PROTECT OUR LIFE SAVING MEDICINE! $10-15/H
/eby/npo/5143949422.html,Program Supervisor - Multisystemic Therapy (MST)
/sby/npo/5145782515.html,Housing Specialist -- Santa Clara and Alameda Counties
/nby/npo/5148193893.html,Shipping Assistant for Non Profit
/sby/npo/5142160649.html,Companion for People with Developmental Disabilities
/sfc/npo/5139127862.html,Director of Vocational Services
我使用“scrapy crawl craig2 -o items_2.csv -t csv”来运行代码。 提前感谢您的帮助
代码是:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider#, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
class CraigslistSampleItem(Item):
title = Field()
link = Field()
class MySpider(CrawlSpider):
name = "craig2"
allowed_domains = ["sfbay.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/"]
# rules = (Rule (SgmlLinkExtractor(allow=("index\d00\.html", ),restrict_xpaths=('//p[@class="button next"]',))
# , callback="parse_items", follow= True),
#)
def start_requests(self):
for i in range(9):
yield Request("http://sfbay.craigslist.org/search/npo?s=" + str(i) + "00" , self.parse_items)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//span[@class="pl"]')
items = []
for ii in titles:
item = CraigslistSampleItem()
item ["title"] = ii.select("a/text()").extract()
item ["link"] = ii.select("a/@href").extract()
items.append(item)
return(items)
【问题讨论】:
标签: javascript python html scrapy