【发布时间】:2020-04-27 15:07:23
【问题描述】:
背景:
我正计划购买汽车,并想监控价格。
我想使用Scrapy 为我做这件事。但是该网站阻止了我的代码执行此操作。
MWE/代码:
#!/usr/bin/python3
# from bs4 import BeautifulSoup
import scrapy # adding scrapy to our file
urls = ['https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/']
class HeadphoneSpider(scrapy.Spider): # our class inherits from scrapy.Spider
name = "headphones"
def start_requests(self):
urls = ['https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/']# list to enter our urls
# urls = ['https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=headphones&rh=i%3Aaps%2Ck%3Aheadphones&ajr=2']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse) # we will explain the callback soon
def parse(self, response):
img_urls = response.css('img::attr(src)').extract()
with open('urls.txt', 'w') as f:
for u in img_urls:
f.write(u + "\n")
def main():
scraper()
输出:
...some stuff above it
2020-01-10 00:37:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/>: HTTP status code is not handled or not allowed
..some more stuff underneath
问题:
我只是不知道如何绕过这个not allowed 来解析价格、公里数等。这会让我的生活变得更轻松。我怎样才能越过这个街区? FWIW 我也用 BeautifulSoup 进行了尝试,但没有成功。
【问题讨论】:
-
这能回答你的问题吗? Scraping in Python - Preventing IP ban
-
@ggorlen 我应该使用 Scrapy 吗?似乎是 BeautifulSoup 之后的另一个层次 - 帮助!~?
-
我不知道,但是网上有很多关于这个问题的问题和文章,所以我认为你需要在这个问题出现之前进行更多的研究获得有用的关注,因为有各种各样的通用技术(在欺骗和其他线程中描述)可以帮助你。
-
@ggorlen 我正在尝试使用 Scrapy,但我得到了
not allowed卡...:/
标签: scrapy