无法使用从网页连接到下一页按钮的某些链接来形成工作 url答案

【问题标题】：Can't form a working url using some link connected to next page button from a webpage无法使用从网页连接到下一页按钮的某些链接来形成工作 url
【发布时间】：2021-03-04 12:11:17
【问题描述】：

我正在尝试解析遍历网页中所有下一页的 131 个产品链接。下一页按钮确实包含下一页链接，但要从中形成一个完整的链接似乎真的很难。

webpage link

到目前为止，我已经尝试过：

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

base = 'https://www.phoenixcontact.com{}'
link = 'https://www.phoenixcontact.com/online/portal/gb?1dmy&urile=wcm%3apath%3a/gben/web/main/products/list_pages/DC_charging_cables_P-10-11-01-01/aa4065f9-ec6c-4765-b2c7-d3b31d247fc6'

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}

def get_links(link):
    r = requests.get(link,headers=headers)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("[class='pxc-sales-data-wrp'][data-product-key] h3 > a[href][onclick]"):
        item_link = base.format(item.get("href"))
        yield item_link

    next_page = soup.select_one("[class='pxc-pager'] a[class='pxc-pager-next']")
    if next_page:
        next_page_link = urljoin(link,next_page.get("href"))
        yield from get_links(next_page_link)

if __name__ == '__main__':
    for elem in get_links(link):
        print(elem)

上述方法让我一遍又一遍地获取第一页的链接，而不是下一页的链接。

如何使用请求从遍历下一页按钮的下一页获取链接？

【问题讨论】：

标签： python python-3.x web-scraping python-requests

【解决方案1】：

你需要保持一个会话，否则你会停留在第一页。

你可以通过找到<base>标签（它保存在标签<base href="..">中）来获取基本url。试试下面的代码：

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}

link = 'https://www.phoenixcontact.com/online/portal/gb?1dmy&urile=wcm%3apath%3a/gben/web/main/products/list_pages/DC_charging_cables_P-10-11-01-01/aa4065f9-ec6c-4765-b2c7-d3b31d247fc6'

s = requests.Session()
s.headers.update(headers)
while True:
    response = s.get(link)
    soup = BeautifulSoup(response.text, "lxml")
    base_url = soup.select_one("base").get("href")

    next_page_element = soup.select_one(".pxc-pager-next")
    if next_page_element is not None:
        next_page_url = next_page_element.get("href")
        link = base_url + next_page_url
        print(link)
    else:
        break

【讨论】：

在获取第二个页面链接时似乎可以正常工作。但是，当我像上面那样在循环中使用它时，我会多次获得第二页链接，而不是超过。谢谢。
@MITHU 我更新我的帖子，它可以获取所有页面。