【问题标题】:How do I follow links in Scrapy?我如何跟踪 Scrapy 中的链接?
【发布时间】:2020-03-17 20:27:26
【问题描述】:
class AmazonspiderSpider(scrapy.Spider):
    start_urls = ['https://www.amazon.co.uk/s?k=9780297833697']

    def parse(self, response):
        Items = AmazonappItem()

        soup = BeautifulSoup(response.text, 'lxml')
        book = soup.find("a", {"class": "a-link-normal a-text-normal"})
        link = book.get('href')
        myurl = "https://www.amazon.co.uk" + link
        Items['bookurl'] = myurl

我在 myurl 中找到了新链接,现在我需要关注这个新链接。怎么做?

【问题讨论】:

标签: python web-scraping beautifulsoup scrapy


【解决方案1】:

你需要yieldscrapy.Request对象:

def parse(self, response):
    Items = AmazonappItem()

    soup = BeautifulSoup(response.text, 'lxml')
    book = soup.find("a", {"class": "a-link-normal a-text-normal"})
    link = book.get('href')
    myurl = "https://www.amazon.co.uk" + link
    Items['bookurl'] = myurl
    yield scrapy.Request(url=myurl, callback=self.your_callback_for_that_url)

【讨论】:

  • 这个“your_callback_for_that_url”是什么?我应该写什么来代替它?
  • 你需要编写一个代码 (def your_callback_for_that_url(self, response): ...) 来处理这样的 URL。
  • 我按照你说的做,但它给出了以下错误:"yield scrapy.Request(url=myurl, callback=self.visit) AttributeError: 'AmazonspiderSpider' object has no attribute 'visit' "
  • code: "class AmazonspiderSpider(scrapy.Spider): name = 'Amazonspider' start_urls = ['amazon.co.uk/s?k=9780297833697' ] def parse(self, response): Items = AmazonappItem() soup = BeautifulSoup( response.text, 'lxml') book = soup.find("a", {"class": "a-link-normal a-text-normal"}) link = book.get('href') myurl = " amazon.co.uk" + link Items['bookurl'] = myurl yield {'Book': myurl } yield scrapy.Request(url=myurl, callback=self.visit)
  • yield scrapy.Request(url=myurl, callback=self.visit) def visit(self, response): soup = BeautifulSoup(response.text, 'lxml') book = soup.find(" span", {"id": "productTitle"}).get_text() print(book)
猜你喜欢
  • 1970-01-01
  • 2020-09-06
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2015-12-18
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多