【问题标题】:Python Scrapy only scraping the same elements over and over againPython Scrapy 只会一遍又一遍地抓取相同的元素
【发布时间】:2017-08-22 20:23:49
【问题描述】:

我正在尝试学习 Scrapy,我正在 yelp 网站上学习 这个LINK 但是当 scrapy 运行时,它会一遍又一遍地抓取相同的电话、地址,而不是抓取不同的部分。我使用的选择器是所有的“li”标签,属于页面的每个餐馆的特定类,每个LI标记包含我使用适当的选择器的每个餐厅信息,但Scape给了我的结果,重复形成2或3家餐馆。出于某种原因,Scrapy 会一遍又一遍地使用相同的部分,而一旦它们在 for 循环中完成,它就应该跳过它们。 这里是代码

    try:
    import scrapy
    from urlparse import urljoin
except ImportError:
    print "\nERROR IMPORTING THE NESSASARY LIBRARIES\n"

#scrapy.optional_features.remove('boto')

url = raw_input('ENTER THE SITE URL : ')

class YelpSpider(scrapy.Spider):
    name = 'yelp spider'
    start_urls = [url]

    def parse(self, response):
        SET_SELECTOR = '.regular-search-result'

        #Going over each li tags containg each resturant belonging to this class

        for yelp in response.css(SET_SELECTOR):

            #getting a slector to get a link to scrape website info from another page
            selector = '.indexed-biz-name a ::attr(href)'

            #getting the complete url joining the extracted part
            momo = urljoin(response.url, yelp.css(selector).extract_first())

            #All the selectors
            name = '.indexed-biz-name a span ::text'
            services = '.category-str-list a ::text'
            address1 = '.neighborhood-str-list ::text'
            address2 = 'address ::text'
            phone = '.biz-phone ::text'

           # extracting them and adding them in a dict 
            try:
                add1 = response.css(address1).extract_first().replace('\n','').replace('\n','')
                add2 = response.css(address2).extract_first().replace('\n','').replace('\n','')
                ADDRESS = add1 + ' ' + add2

                pookiebanana = {

                    "PHONE": response.css(phone).extract_first().replace('\n','').replace('\t',''),
                    "NAME": response.css(name).extract_first().replace('\n','').replace('\t',''),
                    "SERVICES": response.css(services).extract_first().replace('\n','').replace('\t',''),
                    "ADDRESS": ADDRESS,
                }
            except:
                pass

            #Opening another page passing the old dict
            Post = scrapy.Request(momo, callback=self.parse_yelp, meta={'item': pookiebanana})

            #yielding the dict with the website scraped
            yield Post

        #Clicking the next button and recursively calling the same function with the same link
        NEXT_PAGE_SELECTOR = '.u-decoration-none.next.pagination-links_anchor  ::attr(href)'
        next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback=self.parse
            )

    def parse_yelp(self, response):
        #Website selector opening a new page from the link we extracted
        WEBSITE_SELECTOR = '.biz-website.js-add-url-tagging a ::text'

        item = response.meta['item']

        #inside the try block extracting the website info and returning the modified dict
        try:
            item['WEBSITE'] = ' '.join(response.css(WEBSITE_SELECTOR).extract_first().split(' '))
        except:
            pass
        return item

我已经在代码中广泛地评论了我在哪里做了什么。我做错了什么?

这是输出 csv 屏幕截图,显示了重复

这里是 scrapy 的抓取输出,你可以看到它一遍又一遍地抓取相同的东西 发生了什么,我做错了什么?

【问题讨论】:

  • for yelp 循环内你使用response.css 但你应该使用yelp.css

标签: python html css web-scraping scrapy


【解决方案1】:

我无法测试它,但在 for yelp 循环内你应该使用 yelp.css() 但你使用 response.css()

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2019-08-29
    • 1970-01-01
    • 2016-12-20
    • 2017-02-15
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多