【发布时间】:2021-05-11 12:06:47
【问题描述】:
我有一个网页抓取脚本最近遇到了 403 错误。 它只使用基本代码工作了一段时间,但现在遇到了 403 错误。 我已经尝试使用用户代理来规避这个问题,它的工作非常短暂,但现在也出现了 403 错误。
有人知道如何让这个脚本再次运行吗?
如果有帮助,这里有一些上下文: 脚本的目的是找出哪些艺术家在哪些潮汐播放列表上,为了这个问题 - 我只包含了获取该站点的代码的 sn-p,因为那是发生错误的地方。
提前致谢!
基本代码如下所示:
baseurl = 'https://tidal.com/browse'
for i in platformlist:
url = baseurl+str(i[0])
tidal = requests.get(url)
tidal.raise_for_status()
if tidal.status_code != 200:
print ("Website Error: ", url)
pass
else:
soup = bs4.BeautifulSoup(tidal.text,"lxml")
text = str(soup)
text2 = text.lower()
使用用户代理:
user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
url = 'https://tidal.com/playlist/1b418bb8-90a7-4f87-901d-707993838346'
for i in range(1,4):
#Pick a random user agent
user_agent = random.choice(user_agent_list)
#Set the headers
headers = {'User-Agent': user_agent}
#Make the request
tidal = requests.get(url,headers=headers)
print("Request #%d\nUser-Agent Sent:%s\n\nHeaders Received by HTTPBin:"%(i,user_agent))
print(tidal.status_code)
print("-------------------")
#tidal = requests.get(webpage)
tidal.raise_for_status()
print(tidal.status_code)
#make webpage content legible
soup = bs4.BeautifulSoup(tidal.text,"lxml")
print(soup)
#turn bs4 type content into text
text = str(soup)
text2 = text.lower()
【问题讨论】:
标签: python web-scraping beautifulsoup user-agent