循环浏览不同的页面并在 Javascript 网站上抓取数据答案

【问题标题】：Looping through different pages and scraping data on Javascript website循环浏览不同的页面并在 Javascript 网站上抓取数据
【发布时间】：2021-08-11 03:18:08
【问题描述】：

我想我已经接近了，但我不知道为什么我的代码没有按预期工作。我想从第一页抓取数据，然后单击next（箭头）按钮并移至下一页并执行相同操作，依此类推，直到next 箭头按钮变灰，此时驱动程序应该退出。任何帮助将非常感激。代码如下：

   
import selenium
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from bs4 import *
import time
import pandas as pd
import pickle
import html5lib


options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
chrome_driver_path = '/Users/Justin/Desktop/Python/chromedriver'
driver = webdriver.Chrome(executable_path=chrome_driver_path)
url = "https://cryptoli.st/lists/fixed-supply"
driver.get(url)
time.sleep(3)
page = driver.page_source

master_list = []

def get_next_page(url):
    while driver.find_element_by_xpath('/html/body/div/div/div/div[2]/div[1]/div[2]/div[2]/div[2]/ul/li[9]') != True:
        driver.find_elements_by_xpath(
            '/html/body/div/div/div/div[2]/div[1]/div[2]/div[2]/div[2]/ul/li[9]/a').click()
        soup = BeautifulSoup(page, 'html5lib')
        container = soup.find_all('div', attrs={
            'class': 'dataTables_scrollBody'})
        df = pd.read_html(str(container))
        dfs = df[0]
        page_next_data = dfs[['#', 'Name', 'Symbol', 'Max Supply', 'Summary', 'Price', 'Market Cap',
                              '24h Volume', '1h %', '24h %', '7d %', 'Circulation', 'Total Supply', 'Consensus Method']]
        return master_list.append(page_next_data)

    else:
        driver.quit()


def get_data(callback, url):
    global soup, container
    soup = BeautifulSoup(page, 'html5lib')
    container = soup.find_all('div', attrs={
        'class': 'dataTables_scrollBody'})
    df = pd.read_html(str(container))
    dfs = df[0]
    page_one_data = dfs[['#', 'Name', 'Symbol', 'Max Supply', 'Summary', 'Price', 'Market Cap',
                        '24h Volume', '1h %', '24h %', '7d %', 'Circulation', 'Total Supply', 'Consensus Method']]
    return master_list.append(page_one_data)
    return callback(args)


print(get_data(get_next_page, url))

这是它从第一页给我的结果，但它不会继续到下一页，也不会给我任何错误或任何东西。

    #          Name Symbol  ...  Circulation Total Supply               Consensus Method
0   1       Bitcoin    BTC  ...     18713700     18713700                  Proof of Work
1   4  Binance Coin    BNB  ...    153432897    169432897                            NaN
2   5       Cardano    ADA  ...  31948309441  45000000000       Delegated Proof of Stake
3   7           XRP    XRP  ...  46135372183  99990461026  Federated Byzantine Agreement
4  10  Bitcoin Cash    BCH  ...     18742750     18742750                  Proof of Work
5  11      Litecoin    LTC  ...     66752415     66752415                  Proof of Work
6  12     ChainLink   LINK  ...    428009554   1000000000                  Proof of Work

[7 rows x 14 columns]
(pyfinance) Justins-MacBook-Pro:Python-for-Finance-Repo-master Justin$

【问题讨论】：

您很好地概述了您希望此代码执行的操作。但是，我们还需要您详细说明它正在做什么，包括可疑值的跟踪（使用prints）和问题的隔离（删除多余的代码）。请参阅minimal, reproducible example (MRE)。
感谢您的提示！我编辑了帖子。它可以很好地返回第一页，但我无法让它执行回调函数甚至抛出错误，所以我知道出了什么问题。

标签： python selenium loops web-scraping

【解决方案1】：

看起来你把它弄得太复杂了。您在一个函数中有两个 return 语句，并且您使用了函数范围内不存在的 args 变量。请参阅下面的改编和工作代码：

import selenium
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from bs4 import *
import time
import pandas as pd
import pickle
import html5lib

options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
chrome_driver_path = '/usr/local/bin/chromedriver'
driver = webdriver.Chrome(executable_path=chrome_driver_path)
url = "https://cryptoli.st/lists/fixed-supply"
driver.get(url)
time.sleep(3)
page = driver.page_source
master_list = []

def get_next_page(url):
    proceed = True

    while driver.find_element_by_xpath('/html/body/div/div/div/div[2]/div[1]/div[2]/div[2]/div[2]/ul/li[9]/a') != True and proceed:        
        try:
            driver.find_elements_by_xpath('/html/body/div/div/div/div[2]/div[1]/div[2]/div[2]/div[2]/ul/li[9]/a')[0].click()
        except Exception:
            proceed = False

        soup = BeautifulSoup(page, 'html5lib')
        container = soup.find_all('div', attrs={'class': 'dataTables_scrollBody'})
        df = pd.read_html(str(container))
        dfs = df[0]
        page_next_data = dfs[['#', 'Name', 'Symbol', 'Max Supply', 'Summary', 'Price', 'Market Cap', '24h Volume', '1h %', '24h %', '7d %', 'Circulation', 'Total Supply', 'Consensus Method']]
        master_list.append(page_next_data)
    else:
        driver.quit()

def get_data(callback, url):
    global soup, container
    soup = BeautifulSoup(page, 'html5lib')
    container = soup.find_all('div', attrs={'class': 'dataTables_scrollBody'})
    df = pd.read_html(str(container))
    dfs = df[0]
    page_one_data = dfs[['#', 'Name', 'Symbol', 'Max Supply', 'Summary', 'Price', 'Market Cap', '24h Volume', '1h %', '24h %', '7d %', 'Circulation', 'Total Supply', 'Consensus Method']]
    master_list.append(page_one_data)
    
    return callback(url)

get_data(get_next_page, url)
print(master_list)

【讨论】：

谢谢你！我正在自学python，但我不知道从哪里开始。我不明白“并继续：”是否继续是python中的一个特殊变量，如果设置为= True，它会告诉程序继续下一步吗？还看到 Exception 是大写的并且您还没有定义它是“except Exception”只是一种简短的说法“如果您发现异常然后停止迭代？再次非常感谢您！这非常有帮助。
proceed 只是一个可以更改的变量名。 try 和 except 只是为了找出下一个箭头按钮何时不再可点击（灰显）。更合适的方法是使用 webdriver 方法来测试这个元素是否像在this post 中一样可点击最后，如果你可以在except 中退出驱动程序，那么在评估while 条件时会出现错误，因为浏览器窗口那时已经关闭了。

【解决方案2】：

看来，即使是现在，您还没有测试过任何较小的代码片段。看看你的逻辑：

while driver...
    driver....click()
    ...
    return master_list.append(...)

在第一次迭代时退出函数。
追加是就地操作；它总是返回None。

简而言之，您可以跳过循环，跳过点击和数据提取，只需将整个函数替换为正文

return None

【讨论】：