【问题标题】:Get information for products after clicking load more点击加载更多后获取产品信息
【发布时间】:2020-07-02 11:10:35
【问题描述】:

我编写了以下代码从显示一些产品的网页中获取信息,然后单击“加载更多”,显示更多产品。在运行下面的代码时,我只获得前几个产品的信息。我认为代码是正确的,在某处我无法捕捉到一个小错误。如果有人可以帮助我解决这个问题,那就太好了。谢谢!

from selenium import webdriver
import time
from bs4 import BeautifulSoup
import requests
import xlsxwriter

driver = webdriver.Chrome(executable_path=r"C:\Users\Home\Desktop\chromedriver.exe")
driver.get("https://justnebulizers.com/collections/nebulizer-accessories")
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(4)

button= driver.find_element_by_xpath("//a[@class='load-more__btn action_button continue-button']")
button.click() 
time.sleep(1)
soup = BeautifulSoup(driver.page_source, 'html.parser')

def cpap_spider(url):
    source_code= requests.get(url)
    plain_text= source_code.text
    soup= BeautifulSoup(plain_text, 'html.parser')
    for link in soup.findAll("a", {"class":"product-info__caption"}):
            
        href="https://www.justnebulizers.com"+link.get("href")
        #title= link.string
        each_item(href)    
        print(href)
            #print(title)

def each_item(item_url):
    global cols_names, row_i
    source_code= requests.get(item_url)
    plain_text= source_code.text
    soup= BeautifulSoup(plain_text, 'html.parser')
    table=soup.find("table", {"class":"tab_table"})
    if table:
        table_rows = table.find_all('tr')
    else:
        row_i+=1
        return
    for row in table_rows:
      cols = row.find_all('td')
      for ele in range(0,len(cols)):
        temp = cols[ele].text.strip()
        if temp:
          # Here if you want then you can remove unwanted characters like : ? from temp
          # For example "Actual Weight" and ""
          if temp[-1:] == ":":
            temp = temp[:-1]
          # Name of column
          if ele == 0:
            try:
              cols_names_i = cols_names.index(temp)
            except:
              cols_names.append(temp)
              cols_names_i = len(cols_names) -  1
              worksheet.write(0, cols_names_i + 1, temp)
              continue;
          worksheet.write(row_i, cols_names_i + 1, temp)      
    row_i += 1
    
cols_names=[]
cols_names_i = 0
row_i = 1
workbook = xlsxwriter.Workbook('respiratory_care.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, "href")
    
cpap_spider("https://justnebulizers.com/collections/nebulizer-accessories")
#each_item("https://www.1800cpap.com/viva-nasal-cpap-mask-by-3b-medical")       
workbook.close()

【问题讨论】:

  • 你能添加你得到的错误吗?
  • 我没有收到任何错误,我没有得到想要的输出。即包含所有产品的信息,包括单击“加载更多”后生成的产品。我只获取在显示屏上立即可见的产品信息
  • 使用 BeautifulSoup 不会返回您与页面交互(例如单击按钮)后发生的任何事情。
  • 那我该怎么做?
  • 你必须坚持使用硒。

标签: python selenium web-scraping beautifulsoup


【解决方案1】:

您必须单击该按钮并向下滚动。所以我用了:

while True:
            try:
                driver.find_element_by_xpath("//a[@class='load-more__btn action_button continue-button']").click()
                print('button found')
                time.sleep(2)
                driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                print('scrolled down')
                
            except:
                print('button not found')
                break

并且我修复了您代码中的一些问题。

此代码将加载所有产品:

from selenium import webdriver
import time
from bs4 import BeautifulSoup
import requests
import xlsxwriter

driver = webdriver.Chrome(executable_path="chromedriver.exe")

def cpap_spider(url):
    driver.get(url)
    while True:
        try:
            driver.find_element_by_xpath("//a[@class='load-more__btn action_button continue-button']").click()
            print('button found')
            time.sleep(2)
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            print('scrolled down')
            
        except:
            print('button not found')
            break
    ''' As a list
    elems = driver.find_elements_by_class_name("product-info__caption")
    links = [elem.get_attribute('href') for elem in elems]
    print(links)'''
    for link in driver.find_elements_by_class_name("product-info__caption"):
            
        href="https://www.justnebulizers.com"+link.get_attribute("href")
        #title= link.string
        #each_item(href)    
        print(href)
            #print(title)
        
def each_item(item_url):
    global cols_names, row_i
    source_code= requests.get(item_url)
    plain_text= source_code.text
    soup= BeautifulSoup(plain_text, 'html.parser')
    table=soup.find("table", {"class":"tab_table"})
    if table:
        table_rows = table.find_all('tr')
    else:
        row_i+=1
        return
    for row in table_rows:
      cols = row.find_all('td')
      for ele in range(0,len(cols)):
        temp = cols[ele].text.strip()
        if temp:
          # Here if you want then you can remove unwanted characters like : ? from temp
          # For example "Actual Weight" and ""
          if temp[-1:] == ":":
            temp = temp[:-1]
          # Name of column
          if ele == 0:
            try:
              cols_names_i = cols_names.index(temp)
            except:
              cols_names.append(temp)
              cols_names_i = len(cols_names) -  1
              worksheet.write(0, cols_names_i + 1, temp)
              continue;
          worksheet.write(row_i, cols_names_i + 1, temp)      
    row_i += 1
    
cols_names=[]
cols_names_i = 0
row_i = 1
workbook = xlsxwriter.Workbook('respiratory_care.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, "href")
    
cpap_spider("https://justnebulizers.com/collections/nebulizer-accessories")
#each_item("https://www.1800cpap.com/viva-nasal-cpap-mask-by-3b-medical")       
workbook.close()

【讨论】:

  • 这不起作用。它不向 excel 文件加载任何条目,并且仅在输出中显示第一组产品的链接
  • cpap_spider(url) 函数工作正常。它获取所有产品。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2015-12-09
  • 1970-01-01
  • 1970-01-01
  • 2013-10-11
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多