Python Scrapy 只会一遍又一遍地抓取相同的元素答案

【问题标题】：Python Scrapy only scraping the same elements over and over againPython Scrapy 只会一遍又一遍地抓取相同的元素
【发布时间】：2017-08-22 20:23:49
【问题描述】：

我正在尝试学习 Scrapy，我正在 yelp 网站上学习这个LINK 但是当 scrapy 运行时，它会一遍又一遍地抓取相同的电话、地址，而不是抓取不同的部分。我使用的选择器是所有的“li”标签，属于页面的每个餐馆的特定类，每个LI标记包含我使用适当的选择器的每个餐厅信息，但Scape给了我的结果，重复形成2或3家餐馆。出于某种原因，Scrapy 会一遍又一遍地使用相同的部分，而一旦它们在 for 循环中完成，它就应该跳过它们。 这里是代码

    try:
    import scrapy
    from urlparse import urljoin
except ImportError:
    print "\nERROR IMPORTING THE NESSASARY LIBRARIES\n"

#scrapy.optional_features.remove('boto')

url = raw_input('ENTER THE SITE URL : ')

class YelpSpider(scrapy.Spider):
    name = 'yelp spider'
    start_urls = [url]

    def parse(self, response):
        SET_SELECTOR = '.regular-search-result'

        #Going over each li tags containg each resturant belonging to this class

        for yelp in response.css(SET_SELECTOR):

            #getting a slector to get a link to scrape website info from another page
            selector = '.indexed-biz-name a ::attr(href)'

            #getting the complete url joining the extracted part
            momo = urljoin(response.url, yelp.css(selector).extract_first())

            #All the selectors
            name = '.indexed-biz-name a span ::text'
            services = '.category-str-list a ::text'
            address1 = '.neighborhood-str-list ::text'
            address2 = 'address ::text'
            phone = '.biz-phone ::text'

           # extracting them and adding them in a dict 
            try:
                add1 = response.css(address1).extract_first().replace('\n','').replace('\n','')
                add2 = response.css(address2).extract_first().replace('\n','').replace('\n','')
                ADDRESS = add1 + ' ' + add2

                pookiebanana = {

                    "PHONE": response.css(phone).extract_first().replace('\n','').replace('\t',''),
                    "NAME": response.css(name).extract_first().replace('\n','').replace('\t',''),
                    "SERVICES": response.css(services).extract_first().replace('\n','').replace('\t',''),
                    "ADDRESS": ADDRESS,
                }
            except:
                pass

            #Opening another page passing the old dict
            Post = scrapy.Request(momo, callback=self.parse_yelp, meta={'item': pookiebanana})

            #yielding the dict with the website scraped
            yield Post

        #Clicking the next button and recursively calling the same function with the same link
        NEXT_PAGE_SELECTOR = '.u-decoration-none.next.pagination-links_anchor  ::attr(href)'
        next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback=self.parse
            )

    def parse_yelp(self, response):
        #Website selector opening a new page from the link we extracted
        WEBSITE_SELECTOR = '.biz-website.js-add-url-tagging a ::text'

        item = response.meta['item']

        #inside the try block extracting the website info and returning the modified dict
        try:
            item['WEBSITE'] = ' '.join(response.css(WEBSITE_SELECTOR).extract_first().split(' '))
        except:
            pass
        return item

我已经在代码中广泛地评论了我在哪里做了什么。我做错了什么？

这是输出 csv 屏幕截图，显示了重复

这里是 scrapy 的抓取输出，你可以看到它一遍又一遍地抓取相同的东西 发生了什么，我做错了什么？

【问题讨论】：

在for yelp 循环内你使用response.css 但你应该使用yelp.css

标签： python html css web-scraping scrapy

【解决方案1】：

我无法测试它，但在 for yelp 循环内你应该使用 yelp.css() 但你使用 response.css()

【讨论】：