【问题标题】:Scraper implemented with python, using scrapy and selenium launches but shuts downScraper 用 python 实现,使用 scrapy 和 selenium 启动但关闭
【发布时间】:2015-07-02 11:29:43
【问题描述】:

我很难实现我的刮板(我从这里[selenium with scrapy for dynamic page @alecxe 获取了初始示例代码,并完成以获得一些结果,但如果刮板似乎启动(我们可以观察点击的模拟下一个按钮),它会在一秒钟后关闭,并且不会打印或获取项目中的任何内容。

这里是代码

from scrapy.spider import BaseSpider 
from selenium import webdriver

class product_spiderItem(scrapy.Item):
    title = scrapy.Field()
    price=scrapy.Field()
    pass

class ProductSpider(BaseSpider):
    name = "product_spider"
    allowed_domains = ['ebay.com']
    start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

            try:
                next.click()

            # get the data and write it to scrapy items
                response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
                print response.url
                for prod in response.xpath('//ul[@id="GalleryViewInner"]/li/div/div'):
                    item = product_spiderItem()
                    item['title'] = prod.xpath('.//div[@class="gvtitle"]/h3/a/text()').extract()[0]
                    item['price'] = prid.xpath('.//div[@class="prices"]/span[@class="bold"]/text()').extract()[0]
                    print item['price']
                    yield item

            except:
                break

        self.driver.close()

我使用scrapy crawl product_scraper -o products.json来存储结果。我错过了什么?

【问题讨论】:

    标签: python selenium xpath web-scraping scrapy


    【解决方案1】:

    在尝试了解您的代码有什么问题时,我进行了一些编辑,并提出了以下(经过测试的)代码,应该更接近您的目标:

    import scrapy
    from selenium import webdriver
    
    class product_spiderItem(scrapy.Item):
        title = scrapy.Field()
        price=scrapy.Field()
        pass
    
    class ProductSpider(scrapy.Spider):
        name = "product_spider"
        allowed_domains = ['ebay.com']
        start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']
    
        def __init__(self):
            self.driver = webdriver.Firefox()
    
        def parse(self, response):
            self.driver.get(response.url)
    
            while True:
    
                sel = scrapy.Selector(text=self.driver.page_source)
    
                for prod in sel.xpath('//ul[@id="GalleryViewInner"]/li/div/div'):
                    item = product_spiderItem()
                    item['title'] = prod.xpath('.//div[@class="gvtitle"]/h3/a/text()').extract()
                    item['price'] = prod.xpath('.//div[@class="prices"]//span[@class=" bold"]/text()').extract()
                    yield item
    
                next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')
    
                try:
                    next.click()
    
                except:
                    break
    
        def closed(self, reason):
            self.driver.close()
    

    如果此代码效果更好,请尝试。

    【讨论】:

    • 完美运行!非常感谢 !!我猜主要原因是用 sel = scrapy.Selector(text=self.driver.page_source) 而不是 Textresponse 选择页面内容。但是 Textresponse 似乎在我所看到的其他代码中也可以使用。
    • 不客气 :-) 请注意,我删除了 extract() 之后的 [0] 以防止在找不到元素时出错。
    猜你喜欢
    • 1970-01-01
    • 2013-03-05
    • 1970-01-01
    • 2021-06-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-09-02
    • 1970-01-01
    相关资源
    最近更新 更多