【发布时间】:2021-12-27 19:22:14
【问题描述】:
我正在使用网络抓取,首先收集总页数。我已经测试了我为另一个网站制作的代码,但是在获取下一页链接 (href) 时遇到问题。
代码如下:
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests
userName = 'brendanm1975' # just for testing
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
pages = []
with requests.Session() as session:
page_number = 1
url = "https://www.last.fm/user/"+userName+"/library/artists?page="
while True:
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
pages.append(url)
next_link = soup.find("li", class_="pagination-next")
if next_link is None:
break
url = urljoin(url, next_link["href"])
page_number += 1
如您所见,此站点的 href 将链接显示为“?page=2”,这不允许我获取其内容 (https://www.last.fm/user/brendanm1975/library/artists?page=2)。
我已经检查了变量,并且正在获取值。
print(url) # output: https://www.last.fm/user/brendanm1975/library/artists?page=
next_link.find('a').get('href') # output: '?page=2'
有谁知道如何解决这个问题?
【问题讨论】:
-
也许改用他们的API?
-
通过
next_link.find('a').get('href')获取href有什么问题?
标签: python web-scraping python-requests