【发布时间】:2015-05-11 02:18:13
【问题描述】:
我在使用 Scrapy 时遇到了一些问题,我正在使用 newcoder 教程,但似乎被迭代卡住了。 教程在这里 http://newcoder.io/scrape
我正在尝试抓取:http://freefuninaustin.com/
我可以使用以下方法轻松获得所有标题: 'title': '//h3[@class="content-list-title"]//@title'
但是,每当我运行爬虫时,它都会获取每个帖子的所有标题并将它们输入到我的数据库中。我希望它只为每个帖子提取一个标题并输入到数据库中。
蜘蛛本身的代码:
deals_list_xpath = '//article'
item_fields = {
'title': '//h3[@class="content-list-title"]//@title'
def parse(self, response):
"""
Default callback used by Scrapy to process downloaded responses
Testing contracts:
@url http://www.freefuninaustin.com/blog/
@returns items 1
@scrapes title
"""
selector = HtmlXPathSelector(response)
# iterate over deals
for deal in selector.xpath(self.deals_list_xpath):
loader = XPathItemLoader(LivingSocialDeal(), selector=deal)
# define processors
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
# iterate over fields and add xpaths to the loader
for field, xpath in self.item_fields.iteritems():
loader.add_xpath(field, xpath)
yield loader.load_item()
现在是管道
def process_item(self, item, spider):
"""Save deals in the database.
This method is called for every item pipeline component.
"""
session = self.Session()
deal = Deals(**item)
try:
session.add(deal)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
还有来自scrapy的结果
loader = XPathItemLoader(LivingSocialDeal(), selector=deal)
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
{'title': u'Austin Area Splash Pads \u2013 2015 Schedules Reader Recommended: Favorite Parks in Austin and Beyond What\u2019s Up? Weekly (May 11-15, 2015) Weekend Top 10 FREE Events (May 8-10, 2015) Free Deutschen Pfest Parade in Pflugerville 2nd Annual Art in the Park in Round Rock Free Date Nights in Austin (May 7-10, 2015) Mother\u2019s Day Events & Freebies in Austin West Austin Studio Tour 2015 Picks for Families DIY Learning: O. Henry Museum Giveaway: Austin Children\u2019s Services Touch-A-Truck'}
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
{'title': u'Austin Area Splash Pads \u2013 2015 Schedules Reader Recommended: Favorite Parks in Austin and Beyond What\u2019s Up? Weekly (May 11-15, 2015) Weekend Top 10 FREE Events (May 8-10, 2015) Free Deutschen Pfest Parade in Pflugerville 2nd Annual Art in the Park in Round Rock Free Date Nights in Austin (May 7-10, 2015) Mother\u2019s Day Events & Freebies in Austin West Austin Studio Tour 2015 Picks for Families DIY Learning: O. Henry Museum Giveaway: Austin Children\u2019s Services Touch-A-Truck'}
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
{'title': u'Austin Area Splash Pads \u2013 2015 Schedules Reader Recommended: Favorite Parks in Austin and Beyond What\u2019s Up? Weekly (May 11-15, 2015) Weekend Top 10 FREE Events (May 8-10, 2015) Free Deutschen Pfest Parade in Pflugerville 2nd Annual Art in the Park in Round Rock Free Date Nights in Austin (May 7-10, 2015) Mother\u2019s Day Events & Freebies in Austin West Austin Studio Tour 2015 Picks for Families DIY Learning: O. Henry Museum Giveaway: Austin Children\u2019s Services Touch-A-Truck'}
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
应该是这样的
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
{'title': u'1 or 3 Private Golf Lessons'}
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
{'title': u'Los Angeles Dodgers at Oakland Athletics on August 18'}
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
{'title': u'Glycolic or Salicylic Glow Facial Peel'}
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
{'title': u'Boston Red Sox at Oakland Athletics on May 11'}
如何才能在每个帖子中只提取一次标题?
【问题讨论】:
标签: python postgresql python-2.7 scrapy