使用 Python 拉取 HTML 时出现 403 禁止错误，但可以在 Web 浏览器中查看答案

【问题标题】：403 forbidden errors when pulling HTML with Python, but can view in web browser使用 Python 拉取 HTML 时出现 403 禁止错误，但可以在 Web 浏览器中查看
【发布时间】：2021-04-23 18:47:11
【问题描述】：

当在 Python 中使用 requests 库来提取给定 URL 的 HTML 时，例如。如下：

import requests
temp = requests.get(URL)
HTML = temp.text

对于某些 URL，它会被 Nginx 禁止，只返回以下 HTML：

<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>

但是，对于这些 URL，如果我在任何网络浏览器中查看它们，我可以查看网页而不会出现任何 403 禁止错误。

一些示例网页就是这种情况：

URL = http://socialmarketingwriting.com/complete-guide-successful-social-media-manager/
URL = https://rjmccollam.com/podcast/3/

在这些情况下有什么办法可以避免 403 禁止错误？

【问题讨论】：

一些网站需要某种形式的标题才能正常工作。我怀疑一个简单的User-Agent 标头可以解决您的问题。查看documentation 或任何其他在线资源了解更多信息
看起来是问题所在
@ZhouW 我的回答对你有帮助吗？

标签： nginx python-requests

【解决方案1】：

在不指定用户代理的情况下使用 Python 请求时，使用它的默认用户代理（我猜是一个类似机器人的用户代理）。这被许多网站阻止了。要查看，请访问域的robots.txt 文件，例如：

www.google.com/robots.txt

以下是如何使用用户代理：

import requests
from bs4 import BeautifulSoup
URL='https://google.com/search?q=' + sear
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT} # adding the user agent
resp = requests.get(URL, headers=headers)
soup = BeautifulSoup(resp.content, "html.parser") # use this if you want to scrape the site

【讨论】：