【问题标题】:How to scrape images that does not load completely on page load using python/selenium/BeautifulSoup?如何使用 python/selenium/BeautifulSoup 在页面加载时抓取未完全加载的图像?
【发布时间】:2021-07-30 22:47:30
【问题描述】:

我正在尝试抓取一个电子商务网站,我可以成功抓取除图像之外的所有数据。当我尝试抓取图像时,我可以获得前 3 或 4 个图像 url,但其余的显示占位符。这是我的代码:

import requests
import bs4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = 'https://pages.daraz.com.bd/'
offers = url + 'wow/gcp/daraz/megascenario/bd/ramadan_eidcampaign_april21/grocery_free_shipping'
driver = webdriver.Chrome(executable_path=r'D:\Py\Hive-Ecommerce\static\chromedriver.exe')
driver.get(offers)
output = []
wait = WebDriverWait(driver, 30)
wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "product2-in-a-row-item")))
html = driver.page_source
soup = bs4.BeautifulSoup(html)
driver.close()
for product in soup.find_all("div", {"class": "product2-in-a-row-item"}):
    image = product.find("img", {"class": "rax-image"})
    title = product.find("span", {"class": "product-item-bottom-title"})
    price = product.find_all("div", {"class": "lzd-price"})
    discount = product.find_all("span", {"class": "text"})
    link = product.find("a", {"class": "lzd-item"})
    image = image['src']
    productName = title.text
    price = price[0].text if len(price) else 0
    discount = discount[0].text if len(discount) else 0
    link = link['href']
    print(image)

有什么方法可以正确抓取所有图像?

【问题讨论】:

    标签: python selenium web-scraping beautifulsoup


    【解决方案1】:

    您看到的数据是通过 Ajax 从外部 URL 加载的。您可以使用此示例如何加载图像

    注意:脚本可能需要新的 cookie 值。当您打开 Firefox Developer Tools->Network 选项卡时,您将看到那里的请求和所有参数/cookie:

    import json
    import requests
    
    
    api_url = "https://acs-m.daraz.com.bd//h5/mtop.lazada.kangaroo.core.service.route.drzaldlampservice/1.0/"
    
    cookies = {
        "_m_h5_tk": "82c8b6ce7a958daa1f7ce6279854d666_1620564702086",
        "_m_h5_tk_enc": "1a493abf3c3bd09bc1254fdbd0974ecb",
    }
    
    params = {
        "jsv": "2.5.1",
        "appKey": "24936599",
        "t": "1620555452560",
        "sign": "df68466735d67d4fd3cd15e4d58dba7a",
        "api": "mtop.lazada.kangaroo.core.service.route.drzAldLampService",
        "v": "1.0",
        "type": "originaljson",
        "isSec": "1",
        "AntiCreep": "true",
        "timeout": "20000",
        "dataType": "json",
        "sessionOption": "AutoLoginOnly",
        "x-i18n-language": "en-BD",
        "x-i18n-regionID": "BD",
        "data": '{"pageNo":1,"pageSize":30,"pageId":80065583,"platform":"pc","appId":"1472729","bizId":"1000003","terminalType":0,"language":"en","currency":"pkr","regionId":"BD","cna":"","backupParams":"currency,regionId,terminalType,language,id","_pvuuid":"","curPageUrl":"https://pages.daraz.com.bd/wow/gcp/daraz/megascenario/bd/ramadan_eidcampaign_april21/grocery_free_shipping","isbackup":true}',
    }
    
    data = requests.get(api_url, params=params, cookies=cookies).json()
    
    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))
    
    for d in data["data"]["resultValue"]["1472729"]["data"]:
        print(
            "{:<70} {:<6} {}".format(d["itemTitle"], d["itemPrice"], d["itemImg"])
        )
    

    打印:

    Himalayan Pink Salt - 1 kg { হিমালয় পিংক সল্ট }                       1,150  https://static-01.daraz.com.bd/p/81a56ce321dcb5aeae420f7974f12a80.jpg
    Oolong Tea - 48G (Organic Chinese Tea)                                 380    https://static-01.daraz.com.bd/p/233780dab25022001707d673697d50f5.jpg
    SOGOL Saffron jafran { জাফরান } - 10g                                  2,900  https://static-01.daraz.com.bd/p/624f8681b9eac9759012eee14803202d.jpg
    Iranian Saffron jafran { জাফরান } - 5g                                 1,500  https://static-01.daraz.com.bd/p/4ec398138ffcc4a69e9abdf80622f3ba.jpg
    Chinese Herbal Drinks for immunity                                     380    https://static-01.daraz.com.bd/p/4740514fc807143bf09405afb51be36c.jpg
    Sona Pata Powder (সোনাপাতা গুড়া) 1000 Gram                             750    https://static-01.daraz.com.bd/p/mdc/df4670876886daba1f7df0b4c388ba20.jpg
    Green Cardamom Pods - 100g (এলাচি ) JAR                                490    https://static-01.daraz.com.bd/p/43bd74cd21d448116c4b9de14ccbf58f.jpg
    Himalayan Pink Salt - 1kg { হিমালয় পিংক সল্ট }                        1,150  https://static-01.daraz.com.bd/p/81a56ce321dcb5aeae420f7974f12a80.jpg
    Organic MCT Oil 500 ml (Brand- Agrilife, Thailand)                     1,800  https://static-01.daraz.com.bd/p/4a4af95f047ff4f3710c47633ce70622.jpg
    তেলাপোকা দমনের কোরিয়ান জ্যাপস জেল নতুন প্যাকেটে                        2,000  https://static-01.daraz.com.bd/p/2fd4dd44a577a4756e11ca0f3a6060b5.jpg
    Herbal health pack (ভেষজ হেলথ প্যাক) 250 Gram                          590    https://static-01.daraz.com.bd/p/1be7fd5272ea05ea3fd7f9ee3db54f25.jpg
    Gastric and ulcer herbal pack (গ্যাস্ট্রিক ও আলসার ভেষজ প্যাক) 250 Gram 590    https://static-01.daraz.com.bd/p/5011ca734d95ac198197122457fc60de.jpg
    Orjun Powder/ অর্জুন পাওডার-1000 Gram                                  890    https://static-01.daraz.com.bd/p/56a1c4a87341702ed8e05bac7d2f778b.jpg
    Organic Coconut Vinegar - Balsamic Style - 250ml                       1,600  https://static-01.daraz.com.bd/p/eecc1ae4a58fd047ad0cf193c8704bf8.jpg
    Dream Candy Lollipop 5pcs                                              400    https://static-01.daraz.com.bd/p/c244366a6bc1412e6ccfe27f0d7aa728.jpg
    Khalisa Flower Honey খলিশা ফুলের মধু 1 Kg                              1,500  https://static-01.daraz.com.bd/p/196ee2abfc00c622485b89e45ed4120f.jpg
    Whole Natural Dark Chia Seeds - 1 kg UK { সিয়া সীড }                   1,200  https://static-01.daraz.com.bd/p/8390d0d916b9be4ecace753fdfa321a0.jpg
    Manual Hand Juice Maker - Green                                        1,790  https://static-01.daraz.com.bd/p/7feb58f07d05af10af44e1ae045dec38.jpg
    Omor Corporation Manual Juice Maker                                    2,050  https://static-01.daraz.com.bd/p/7d4599b4b28e20fc4cc78ac07b9da709.jpg
    Pistachio Raw- 500gm                                                   1,320  https://static-01.daraz.com.bd/p/75a76ab6c89df51a0264e25340b7fed3.png
    Sunflower Seed - 500gm                                                 1,300  https://static-01.daraz.com.bd/p/8d3f73f5cd037791bc38404833472a37.png
    Whole Natural Dark Chia Seeds - 500g UK { সিয়া সীড }                   790    https://static-01.daraz.com.bd/p/8390d0d916b9be4ecace753fdfa321a0.jpg
    Black Seed Powder /কালোজিরা গুঁড়া- 1000 Gram                           690    https://static-01.daraz.com.bd/p/2ee68eef315b6c6e18b2086f7a07926b.jpg
    Rayner's Organic Raw & Unpasteurised Coconut Vinegar - 250ml           1,500  https://static-01.daraz.com.bd/p/da7fd5b1a11109153ddbb72e8d2d7bd8.jpg
    Bel Powder /বেল গুড়া -1000 Gram                                        690    https://static-01.daraz.com.bd/p/5011ca734d95ac198197122457fc60de.jpg
    Orjun Powder/ অর্জুন পাওডার-500 gram                                   490    https://static-01.daraz.com.bd/p/56a1c4a87341702ed8e05bac7d2f778b.jpg
    Himalayan Pink Salt - 500gm { হিমালয় পিংক সল্ট }                      700    https://static-01.daraz.com.bd/p/81a56ce321dcb5aeae420f7974f12a80.jpg
    Starbucks_Blonde Espresso Roast Coffee 200gm                           1,500  https://static-01.daraz.com.bd/p/2e0145dfd770eb9d94a8429f39fcccf8.jpg
    Sylhet Tea - Raw Tea ( রঙ / রং চা)                                     400    https://static-01.daraz.com.bd/p/9665c994f20f12170c60f6d21019d54b.jpg
    Himalayan Pink Salt - 1kg { হিমালয় পিংক সল্ট }                        1,140  https://static-01.daraz.com.bd/p/81a56ce321dcb5aeae420f7974f12a80.jpg
    

    【讨论】:

    • 嘿,这完全有效。不过只有一个问题。我如何理解哪个 api 负责用产品填充页面?
    【解决方案2】:

    这是您使用 selenium 的解决方案。数据是从外部 URL 加载的,需要一些时间才能加载。

    import requests
    import bs4
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import time
    url = 'https://pages.daraz.com.bd/'
    offers = url + 'wow/gcp/daraz/megascenario/bd/ramadan_eidcampaign_april21/grocery_free_shipping'
    driver = webdriver.Firefox(executable_path=r'*/*/geckodriver')
    driver.get(offers)
    
    wait = WebDriverWait(driver, 30)
    wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "product2-in-a-row-item")))
    scrolls = 7
    while True:
        scrolls -= 1
        print(scrolls)
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
        time.sleep(5)
        if scrolls < 0:
            break
    html = driver.page_source
    output = []
    driver.close()
    soup = bs4.BeautifulSoup(html, 'html.parser')
    for product in soup.find_all("div", {"class": "product2-in-a-row-item"}):
        image = product.find("img", {"class": "rax-image"})
        title = product.find("span", {"class": "product-item-bottom-title"})
        price = product.find_all("div", {"class": "lzd-price"})
        discount = product.find_all("span", {"class": "text"})
    
        links = product.select('img.rax-image[src]')[0]['src']
        print(links)
        if links.startswith(" https"):
            print('link : ', links)
    
        image = image['src']
        productName = title.text
        price = price[0].text if len(price) else 0
        discount = discount[0].text if len(discount) else 0
    

    【讨论】:

    • 嘿,这实际上是一个很好的解决方案,不会对我的代码进行太多更改,我需要这个向下滚动功能。但它仍然会打印一些占位符图像。
    猜你喜欢
    • 1970-01-01
    • 2016-06-10
    • 1970-01-01
    • 1970-01-01
    • 2016-07-16
    • 1970-01-01
    • 2014-06-11
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多