问题
运行您的代码并检查res 值,您会收到一个503 错误。这意味着服务不可用 (htttp status 503)。
解决方案
使用this SO 帖子跟进,似乎将headers={"User-Agent":"Defined"} 添加到get 请求确实有效。
res = requests.get(url, headers={"User-Agent": "Defined"})
将返回 200(OK)响应。
扭曲
亚马逊实际上会检查网络抓取工具,即使您会返回一个页面,打印结果 (print(soup)) 也可能会向您显示以下内容:
<body>
<!--
To discuss automated access to Amazon data please contact api-services-support@amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->
...
<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
</div>
</div>
计数器
但是您可以使用selenium 来模拟人类。对我来说,一个最小的工作示例如下:
import selenium.webdriver
url = 'http://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=lp_16225007011_1_7'
driver = selenium.webdriver.Firefox()
driver.get(url)
title = driver.find_element_by_id('productTitle').text
print(title)
打印出来的
Acer SB220Q bi 21.5 Inches Full HD (1920 x 1080) IPS Ultra-Thin Zero Frame Monitor (HDMI & VGA Port), Black
使用 selenium 的一个小问题是它比 requests 库慢得多。还会弹出一个显示该页面的新屏幕,但幸运的是,我们可以使用headless 驱动程序对该屏幕进行一些操作。