【问题标题】:Scrapy Iteration issue in PythonPython中的Scrapy迭代问题
【发布时间】:2015-05-11 02:18:13
【问题描述】:

我在使用 Scrapy 时遇到了一些问题,我正在使用 newcoder 教程,但似乎被迭代卡住了。 教程在这里 http://newcoder.io/scrape

我正在尝试抓取:http://freefuninaustin.com/

我可以使用以下方法轻松获得所有标题: 'title': '//h3[@class="content-list-title"]//@title'

但是,每当我运行爬虫时,它都会获取每个帖子的所有标题并将它们输入到我的数据库中。我希望它只为每个帖子提取一个标题并输入到数据库中。

蜘蛛本身的代码:

deals_list_xpath = '//article'
item_fields = {
    'title': '//h3[@class="content-list-title"]//@title'

def parse(self, response):
    """
    Default callback used by Scrapy to process downloaded responses

    Testing contracts:
    @url http://www.freefuninaustin.com/blog/
    @returns items 1
    @scrapes title 

    """
    selector = HtmlXPathSelector(response)

    # iterate over deals
    for deal in selector.xpath(self.deals_list_xpath):
        loader = XPathItemLoader(LivingSocialDeal(), selector=deal)

        # define processors
        loader.default_input_processor = MapCompose(unicode.strip)
        loader.default_output_processor = Join()

        # iterate over fields and add xpaths to the loader
        for field, xpath in self.item_fields.iteritems():
            loader.add_xpath(field, xpath)
        yield loader.load_item()

现在是管道

def process_item(self, item, spider):
    """Save deals in the database.

    This method is called for every item pipeline component.

    """
    session = self.Session()
    deal = Deals(**item)

    try:
        session.add(deal)
        session.commit()
    except:
        session.rollback()
        raise
    finally:
        session.close()

    return item

还有来自scrapy的结果

     loader = XPathItemLoader(LivingSocialDeal(), selector=deal)
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
    {'title': u'Austin Area Splash Pads \u2013 2015 Schedules Reader Recommended: Favorite Parks in Austin and Beyond What\u2019s Up? Weekly (May 11-15, 2015) Weekend Top 10 FREE Events (May 8-10, 2015) Free Deutschen Pfest Parade in Pflugerville 2nd Annual Art in the Park in Round Rock Free Date Nights in Austin (May 7-10, 2015) Mother\u2019s Day Events & Freebies in Austin West Austin Studio Tour 2015 Picks for Families DIY Learning: O. Henry Museum Giveaway: Austin Children\u2019s Services Touch-A-Truck'}
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
    {'title': u'Austin Area Splash Pads \u2013 2015 Schedules Reader Recommended: Favorite Parks in Austin and Beyond What\u2019s Up? Weekly (May 11-15, 2015) Weekend Top 10 FREE Events (May 8-10, 2015) Free Deutschen Pfest Parade in Pflugerville 2nd Annual Art in the Park in Round Rock Free Date Nights in Austin (May 7-10, 2015) Mother\u2019s Day Events & Freebies in Austin West Austin Studio Tour 2015 Picks for Families DIY Learning: O. Henry Museum Giveaway: Austin Children\u2019s Services Touch-A-Truck'}
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>
    {'title': u'Austin Area Splash Pads \u2013 2015 Schedules Reader Recommended: Favorite Parks in Austin and Beyond What\u2019s Up? Weekly (May 11-15, 2015) Weekend Top 10 FREE Events (May 8-10, 2015) Free Deutschen Pfest Parade in Pflugerville 2nd Annual Art in the Park in Round Rock Free Date Nights in Austin (May 7-10, 2015) Mother\u2019s Day Events & Freebies in Austin West Austin Studio Tour 2015 Picks for Families DIY Learning: O. Henry Museum Giveaway: Austin Children\u2019s Services Touch-A-Truck'}
2015-05-10 20:56:49-0500 [livingsocial] DEBUG: Scraped from <200 http://freefuninaustin.com/blog/>

应该是这样的

2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
    {'title': u'1 or 3 Private Golf Lessons'}
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
    {'title': u'Los Angeles Dodgers at Oakland Athletics on August 18'}
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
    {'title': u'Glycolic or Salicylic Glow Facial Peel'}
2015-05-10 21:13:55-0500 [livingsocial] DEBUG: Scraped from <200 https://www.livingsocial.com/cities/15-san-francisco>
    {'title': u'Boston Red Sox at Oakland Athletics on May 11'}

如何才能在每个帖子中只提取一次标题?

【问题讨论】:

    标签: python postgresql python-2.7 scrapy


    【解决方案1】:

    在 XPath 表达式的开头添加一个.(点)使其“特定于上下文”:

    item_fields = {
        'title': './/h3[@class="content-list-title"]//@title'
    }
    

    页面上还有不同“类型”的article 元素,要处理这两种元素,您需要将表达式重写为:

    .//h3[@class="content-list-title" or @class="cp-title-small"]//@title
    

    【讨论】:

    • 谢谢!我回家后会试一试。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-02-15
    • 2020-07-14
    • 2021-07-24
    • 1970-01-01
    • 2020-12-12
    • 1970-01-01
    相关资源
    最近更新 更多