【问题标题】:Unable to parse some numbers located in innermost pages after using a main link recursively递归使用主链接后无法解析位于最内页的某些数字
【发布时间】:2021-04-09 02:04:51
【问题描述】:

我试图弄清楚如何从 webpages 中解析订单号,该订单号从 here 开始。具体来说,如果您浏览此link,您可以看到与每个指向内页的容器相关联的Read more 链接。您将再次看到此Read more 链接与另一组容器相关联,这些容器指向最内页,最终指向订单号所在的this page

我可以使用这段代码来获取与read more链接关联的链接:

import re
import requests
from bs4 import BeautifulSoup

base = 'https://www.rittal.com{}'
url = 'https://www.rittal.com/com-en/products/PG0002SCHRANK1'

def get_links(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"html.parser")
    for item_link in soup.select("a.custom-link:contains('Read more')"):
        target_link = base.format(item_link.get("href"))
        yield target_link

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
        for item in get_links(s,url):
            print(item)

我可以使用以下代码块来解析订单号:

item = soup.select_one("list-filter")
item_ids = re.findall(r"variantId=(.*?)\&",str(item))
if item_ids:
    for item_id in item_ids:
        print(item_id)

我不明白的是如何递归解析从this link开始的订单号。

【问题讨论】:

  • 你的意思是你想从this页面开始,得到一个商品的订单详情列表,返回然后到下一个商品重复?
  • 是的,就是这样。

标签: python python-3.x recursion web-scraping beautifulsoup


【解决方案1】:

以下是实现相同目的的方法之一:

import re
import requests
from bs4 import BeautifulSoup

base = 'https://www.rittal.com{}'
url = 'https://www.rittal.com/com-en/products/PG0002SCHRANK1'

def get_links(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    if soup.select_one("list-filter"):
        item = soup.select_one("list-filter")
        ids = re.findall(r"variantId=(.*?)\&",str(item))
        yield ids

    elif soup.select_one("a.custom-link:contains('Read more')"):
        for item_link in soup.select("a.custom-link:contains('Read more')"):
            inner_link = base.format(item_link.get("href"))
            yield from get_links(s,inner_link)

    else:
        yield f"unproductive url: {r.url}"

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
        for item in get_links(s,url):
            print(item)

【讨论】:

    【解决方案2】:

    这是一些巨大的抓取,它会一直持续下去,因为当我遍历页面时会有更多的Read more

    import re
    import requests
    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    opts = webdriver.ChromeOptions()
    opts.add_argument('--headless')
       
    base = 'https://www.rittal.com{}'
    url = 'https://www.rittal.com/com-en/products/PG0002SCHRANK1'
    
    def get_links(s,link):
        r = s.get(link)
        soup = BeautifulSoup(r.text,"html.parser")
        #This checks if there is any 'Read More' option in the page and if not, it goes to scrape the order details.
        if len(soup.select("a.custom-link:contains('Read more')")) != 0:     
            for item_link in soup.select("a.custom-link:contains('Read more')"):
                target_link = base.format(item_link.get("href"))
                get_links(s,target_link)     #Recursive Call instead of yielding
                # yield target_link
        else:
            get_order_number(link)   
    
    #get's the table in the page containing all the order details
    def get_order_number(link):
    
        try:
            driver = webdriver.Chrome(options=opts, executable_path='path to\chromedriver.exe')
            driver.get(link)
            driver.implicitly_wait(10)
            e = driver.find_element_by_xpath('/html/body/div[3]/div/div/div/div/div[2]/div[2]/div/div/button[2]')
            driver.execute_script("arguments[0].click();", e)   #cookie prompt removal
    
            tr_tags = driver.find_elements_by_tag_name('tr')
            print(driver.find_element_by_tag_name('h1').text)     #prints the product name before the order details
            for tr in tr_tags:
                print(tr.text)
            driver.quit()
        except:            #when there is no product details/page Eg. Klappe HD
            print('No details available')
    
    if __name__ == '__main__':
        count = 0
        with requests.Session() as s:
            s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
            get_links(s,url)
    

    最终产品页面使用 Javascript 显示订单详情,BeautifulSoup 对于动态页面效率不高。因此我在这里使用Selenium 来抓取细节,您可以使用其他适合您的方法。

    我已将get_links 设为递归方法,正如您在 cmets 中看到的那样。它工作得很好,因为我使用了 Selenium,所以可能需要一些时间来浏览所有产品!谢谢你的好问题!

    PS:我没有完全看到最后的输出,因为它看起来大约 10 分钟或更长时间(工作超级酷,我不得不停止它),因为有很多产品,但我只是回答了如何做 recursive call并在获取order no and other details 时稍作感触。

    【讨论】:

      猜你喜欢
      • 2018-07-21
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-06-18
      • 1970-01-01
      • 2016-10-09
      • 2018-04-21
      相关资源
      最近更新 更多