【问题标题】:Scrapy only outputting an open bracketScrapy 只输出一个左括号
【发布时间】:2015-06-18 21:30:26
【问题描述】:

我正在尝试在数学/科学/经济学页面下抓取所有可汗学院页面的标题和 URL。但是,目前它只输出一个左括号,在此之前它只会抓取起始 URL。

from openbar_index.items import OpenBarIndexItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class OpenBarSpider(CrawlSpider):
    """
    scrapes website URLs from educational websites and commits urls/webpage names/text to a document
    """

    name = 'openbar'
    allowed_domains = 'khanacademy.org'
    start_urls = [

        "https://www.khanacademy.org"

    ]

     rules = [

            Rule(SgmlLinkExtractor(allow = ['/math/']), callback='parse_item', follow = True),
             Rule(SgmlLinkExtractor(allow = ['/science/']), callback='parse_item', follow=True),
             Rule(SgmlLinkExtractor(allow = ['/economics-finance-domain/']), callback='parse_item', follow=True)
    ]

    def parse_item(self, response):

         item = OpenBarIndexItem()
         url = response.url
         item['url'] = url
         item['title'] = response.xpath('/html/head/title/text()').extract()
         yield item

有没有人知道为什么会发生这种情况或有关如何解决它的提示?

【问题讨论】:

    标签: python url web-crawler scrapy scrape


    【解决方案1】:

    问题是分配给allowed_domains。根据documentation,这不能是string,而是list。使用该字符串,scrapy 将潜在结果过滤为异地请求,因为没有有效的域。

    所以在下一行添加方括号应该可以解决它

        allowed_domains = ['khanacademy.org']
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2016-09-18
      • 1970-01-01
      • 1970-01-01
      • 2021-12-25
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多