递归使用主链接后无法解析位于最内页的某些数字答案

【问题标题】：Unable to parse some numbers located in innermost pages after using a main link recursively递归使用主链接后无法解析位于最内页的某些数字
【发布时间】：2021-04-09 02:04:51
【问题描述】：

我试图弄清楚如何从 webpages 中解析订单号，该订单号从 here 开始。具体来说，如果您浏览此link，您可以看到与每个指向内页的容器相关联的Read more 链接。您将再次看到此Read more 链接与另一组容器相关联，这些容器指向最内页，最终指向订单号所在的this page。

我可以使用这段代码来获取与read more链接关联的链接：

import re
import requests
from bs4 import BeautifulSoup

base = 'https://www.rittal.com{}'
url = 'https://www.rittal.com/com-en/products/PG0002SCHRANK1'

def get_links(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"html.parser")
    for item_link in soup.select("a.custom-link:contains('Read more')"):
        target_link = base.format(item_link.get("href"))
        yield target_link

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
        for item in get_links(s,url):
            print(item)

我可以使用以下代码块来解析订单号：

item = soup.select_one("list-filter")
item_ids = re.findall(r"variantId=(.*?)\&",str(item))
if item_ids:
    for item_id in item_ids:
        print(item_id)

我不明白的是如何递归解析从this link开始的订单号。

【问题讨论】：

你的意思是你想从this页面开始，得到一个商品的订单详情列表，返回然后到下一个商品重复？
是的，就是这样。

标签： python python-3.x recursion web-scraping beautifulsoup

【解决方案1】：

以下是实现相同目的的方法之一：

import re
import requests
from bs4 import BeautifulSoup

base = 'https://www.rittal.com{}'
url = 'https://www.rittal.com/com-en/products/PG0002SCHRANK1'

def get_links(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    if soup.select_one("list-filter"):
        item = soup.select_one("list-filter")
        ids = re.findall(r"variantId=(.*?)\&",str(item))
        yield ids

    elif soup.select_one("a.custom-link:contains('Read more')"):
        for item_link in soup.select("a.custom-link:contains('Read more')"):
            inner_link = base.format(item_link.get("href"))
            yield from get_links(s,inner_link)

    else:
        yield f"unproductive url: {r.url}"

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
        for item in get_links(s,url):
            print(item)

【讨论】：

【解决方案2】：

这是一些巨大的抓取，它会一直持续下去，因为当我遍历页面时会有更多的Read more。

import re
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

opts = webdriver.ChromeOptions()
opts.add_argument('--headless')
   
base = 'https://www.rittal.com{}'
url = 'https://www.rittal.com/com-en/products/PG0002SCHRANK1'

def get_links(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"html.parser")
    #This checks if there is any 'Read More' option in the page and if not, it goes to scrape the order details.
    if len(soup.select("a.custom-link:contains('Read more')")) != 0:     
        for item_link in soup.select("a.custom-link:contains('Read more')"):
            target_link = base.format(item_link.get("href"))
            get_links(s,target_link)     #Recursive Call instead of yielding
            # yield target_link
    else:
        get_order_number(link)   

#get's the table in the page containing all the order details
def get_order_number(link):

    try:
        driver = webdriver.Chrome(options=opts, executable_path='path to\chromedriver.exe')
        driver.get(link)
        driver.implicitly_wait(10)
        e = driver.find_element_by_xpath('/html/body/div[3]/div/div/div/div/div[2]/div[2]/div/div/button[2]')
        driver.execute_script("arguments[0].click();", e)   #cookie prompt removal

        tr_tags = driver.find_elements_by_tag_name('tr')
        print(driver.find_element_by_tag_name('h1').text)     #prints the product name before the order details
        for tr in tr_tags:
            print(tr.text)
        driver.quit()
    except:            #when there is no product details/page Eg. Klappe HD
        print('No details available')

if __name__ == '__main__':
    count = 0
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
        get_links(s,url)

最终产品页面使用 Javascript 显示订单详情，BeautifulSoup 对于动态页面效率不高。因此我在这里使用Selenium 来抓取细节，您可以使用其他适合您的方法。

我已将get_links 设为递归方法，正如您在 cmets 中看到的那样。它工作得很好，因为我使用了 Selenium，所以可能需要一些时间来浏览所有产品！谢谢你的好问题！

PS：我没有完全看到最后的输出，因为它看起来大约 10 分钟或更长时间（工作超级酷，我不得不停止它），因为有很多产品，但我只是回答了如何做 recursive call并在获取order no and other details 时稍作感触。

【讨论】：