【发布时间】:2021-03-04 12:11:17
【问题描述】:
我正在尝试解析遍历网页中所有下一页的 131 个产品链接。下一页按钮确实包含下一页链接,但要从中形成一个完整的链接似乎真的很难。
到目前为止,我已经尝试过:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
base = 'https://www.phoenixcontact.com{}'
link = 'https://www.phoenixcontact.com/online/portal/gb?1dmy&urile=wcm%3apath%3a/gben/web/main/products/list_pages/DC_charging_cables_P-10-11-01-01/aa4065f9-ec6c-4765-b2c7-d3b31d247fc6'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}
def get_links(link):
r = requests.get(link,headers=headers)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("[class='pxc-sales-data-wrp'][data-product-key] h3 > a[href][onclick]"):
item_link = base.format(item.get("href"))
yield item_link
next_page = soup.select_one("[class='pxc-pager'] a[class='pxc-pager-next']")
if next_page:
next_page_link = urljoin(link,next_page.get("href"))
yield from get_links(next_page_link)
if __name__ == '__main__':
for elem in get_links(link):
print(elem)
上述方法让我一遍又一遍地获取第一页的链接,而不是下一页的链接。
如何使用请求从遍历下一页按钮的下一页获取链接?
【问题讨论】:
标签: python python-3.x web-scraping python-requests