【问题标题】:Scraping table from Python Beautifulsoup从 Python Beautifulsoup 抓取表格
【发布时间】:2021-01-04 00:32:11
【问题描述】:

我试图从这个网站上刮桌子:https://stockrow.com/VRTX/financials/income/quarterly

我正在使用 Python Google Colab,我希望将日期作为列。 (例如 2020-06-30 等)我用代码来做这样的事情:

source = urllib.request.urlopen('https://stockrow.com/VRTX/financials/income/quarterly').read()
soup = bs.BeautifulSoup(source,'lxml')
table = soup.find_all('table')

但是,我无法获得表格。我对抓取有点陌生,所以我查看了其他 Stackoverflow 页面,但无法解决问题。你能帮我么?那将不胜感激。

【问题讨论】:

    标签: python web-scraping beautifulsoup google-colaboratory


    【解决方案1】:

    第一个问题是,该表是通过 javascript 加载的,而 BeautifulSoup 没有找到它,因为在解析时它还没有加载。要解决此问题,您需要使用 selenium。

    第二个问题是,HTML中没有table标签,它使用网格格式。

    由于您使用的是 Google Colab,因此您需要在那里安装 selenium 网络驱动程序(代码取自 this answer):

    !pip install selenium
    !apt-get update # to update ubuntu to correctly run apt install
    !apt install chromium-chromedriver
    !cp /usr/lib/chromium-browser/chromedriver /usr/bin
    import sys
    sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
    from selenium import webdriver
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
    

    之后你可以加载页面并解析它:

    from bs4 import BeautifulSoup
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    
    # load page via selenium
    wd.get("https://stockrow.com/VRTX/financials/income/quarterly")
    
    # wait 5 seconds until element with class mainGrid will be loaded
    grid = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.CLASS_NAME, 'mainGrid')))
    
    # parse content of the grid
    soup = BeautifulSoup(grid.get_attribute('innerHTML'), 'lxml')
    
    # access grid cells, your logic should be here
    for tag in soup.find_all('div', {'class': 'financials-value'}):
      print(tag)
    

    【讨论】:

    • 非常感谢,这很好用。您知道如何将输出更改为表格格式吗?
    • 目标网站使用 javascript 框架(可能是 react)来呈现表格。我不认为他们有替代的渲染机制。您可以在浏览器开发人员工具中检查表的结构并使用类名对其进行解析。
    【解决方案2】:

    您可以使用他们的 API 来加载数据:

    import requests
    import pandas as pd
    
    
    indicators_url = 'https://stockrow.com/api/indicators.json'
    data_url = 'https://stockrow.com/api/companies/VRTX/financials.json?ticker=VRTX&dimension=Q&section=Income+Statement'
    
    indicators = {i['id']: i for i in requests.get(indicators_url).json()}
    all_data = []
    for d in requests.get(data_url).json():
        d['id'] = indicators[d['id']]['name']
        all_data.append(d)
    
    df = pd.DataFrame(all_data)
    df.to_csv('data.csv')
    print(df)
    

    打印:

                                         id    2020-06-30    2020-03-31    2019-12-31   2019-09-30   2019-06-30  ...   2011-12-31   2011-09-30    2011-06-30    2011-03-31    2010-12-31    2010-09-30
    0          Consolidated Net Income/Loss   837270000.0   602753000.0   583234100.0   57518000.0  267427000.0  ...  188141000.0  228452000.0  -199318000.0  -176096000.0  -180392000.0  -208957000.0
    1      EPS (Basic, from Continuous Ops)        3.2248        2.3199        2.2654       0.2239        1.044  ...       0.9374        1.109       -0.9751       -0.8703       -0.8966       -1.0402
    2                     Net Profit Margin        0.5492        0.3978        0.4127       0.0606       0.2841  ...       0.2816       0.3354       -1.5213       -2.3906       -2.7531       -8.7816
    3                          Gross Profit  1339965000.0  1352610000.0  1228253000.0  817914000.0  805553000.0  ...  533213000.0  620794000.0   105118000.0    70996000.0    62475000.0    20567000.0
    4                  Income Tax Provision   -12500000.0    54781000.0    93716000.0   13148000.0   59711000.0  ...   22660000.0  -27842000.0    24448000.0           0.0           NaN           0.0
    5                      Operating Income   718033000.0   720224100.0   551464400.0   99333000.0  269960000.0  ...  223901900.0  215707000.0  -165890000.0  -159899000.0  -166634000.0  -199588000.0
    6                                  EBIT   718033000.0   720224100.0   551464700.0   99333000.0  269960000.0  ...  223901900.0  215707000.0  -165890000.0  -159899000.0  -166634000.0  -199588000.0
    7         EPS (Diluted, from Cont. Ops)        3.1787        2.2874        2.2319       0.2208       1.0293  ...       1.0011       1.0415       -0.9751       -0.8703       -0.8966       -1.0402
    8                                EBITDA   744730000.0   747045000.0   577720400.0  125180000.0  297658000.0  ...  233625900.0  223457000.0  -157181000.0  -151041000.0  -158429000.0  -192830000.0
    9             EPS (Basic, Consolidated)        3.2248        2.3199        2.2654       0.2239        1.044  ...       0.9374        1.109       -0.9751       -0.8703       -0.8966       -1.0402
    10                                  EBT   824770000.0   657534000.0   676950000.0   70666000.0  327138000.0  ...  210801000.0  200610000.0  -174870000.0  -176096000.0  -180392000.0  -208957000.0
    11           Operating Cash Flow Margin        0.6812        0.5384        0.3156       0.3525       0.4927  ...       0.8941       0.0651       -1.8894       -2.5336        -2.535       -6.8918
    12                           EBT margin         0.541         0.434         0.479       0.0744       0.3475  ...       0.3742       0.3043       -1.5283       -2.3906       -2.7531       -8.7816
    13                          EBIT Margin         0.471        0.4754        0.3902       0.1046       0.2868  ...       0.3975       0.3272       -1.4498       -2.1707       -2.5431       -8.3878
    14    Income from Continuous Operations   837270000.0   602753000.0   583234000.0   57518000.0  267427000.0  ...  188141000.0  228452000.0  -199318000.0  -176096000.0  -180392000.0  -208957000.0
    15                         R&D Expenses   420928000.0   448528000.0   480011000.0  555948000.0  379091000.0  ...  186438000.0  189052000.0   173604000.0   158612000.0   168888000.0   170434000.0
    16      Non-operating Interest Expenses    13871000.0    14136000.0    14249000.0   14548000.0   14837000.0  ...   11659000.0    7059000.0     6962000.0    12001000.0     7686000.0     3951000.0
    17                        EBITDA Margin        0.4885        0.4931        0.4088       0.1318       0.3162  ...       0.4147        0.339       -1.3737       -2.0505       -2.4179       -8.1038
    18         Non-operating Income/Expense   106737000.0   -62690000.0   125485000.0  -28667000.0   57178000.0  ...  -13101000.0  -15097000.0    -8980000.0   -16197000.0   -13758000.0    -9369000.0
    19                          EPS (Basic)          3.22          2.32          2.26         0.22         1.04  ...         0.76         1.06         -0.85         -0.87          -0.9         -1.04
    20                         Gross Margin         0.879        0.8927        0.8691       0.8611       0.8558  ...       0.9465       0.9417        0.9187        0.9638        0.9535        0.8643
    21                              Revenue  1524485000.0  1515107000.0  1413265000.0  949828000.0  941293000.0  ...  563340000.0  659200000.0   114424000.0    73662000.0    65524000.0    23795000.0
    22            Shares (Diluted, Average)   263403000.0   263515000.0   262108000.0  260473000.0  259822000.0  ...  217602000.0  219349000.0   204413000.0   202329000.0   201355000.0   200887000.0
    23                      Cost of Revenue   184520000.0   162497000.0   185012000.0  131914000.0  135740000.0  ...   30127000.0   38406000.0     9306000.0     2666000.0     3049000.0     3228000.0
    24                        SG&A Expenses   191804000.0   182258000.0   195277000.0  159674000.0  156502000.0  ...  121881000.0  110654000.0    96663000.0    71523000.0    62478000.0    48855000.0
    25          EPS (Diluted, Consolidated)        3.1787        2.2874        2.2319       0.2208       1.0293  ...       1.0011       1.0415       -0.9751       -0.8703       -0.8966       -1.0402
    26                       Revenue Growth        0.6196         0.765        0.6242       0.2107       0.2515  ...       7.5975      26.7033        2.6185        2.2842        0.9335       -0.0466
    27             Shares (Basic, Weighted)   259637000.0   259815000.0   256728000.0  256946000.0  256154000.0  ...  204891000.0  206002000.0   204413000.0   202329000.0   200402000.0   200887000.0
    28                     Income after Tax   837270000.0   602753000.0   583234000.0   57518000.0  267427000.0  ...  188141000.0  228452000.0  -199318000.0  -176096000.0  -180392000.0  -208957000.0
    29                        EPS (Diluted)          3.18          2.29          2.23         0.22         1.03  ...         0.74         1.02         -0.85         -0.87          -0.9         -1.04
    30                    Net Income Common   837270000.0   602753000.0   583234100.0   57518000.0  267427000.0  ...  158629000.0  221110000.0  -174069000.0  -176096000.0  -180392000.0  -208957000.0
    31           Shares (Diluted, Weighted)   263403000.0   263515000.0   260673000.0  260473000.0  259822000.0  ...  208807000.0  219349000.0   204413000.0   202329000.0   200402000.0   200887000.0
    32             Non-Controlling Interest           NaN           NaN           NaN          NaN          NaN  ...   29512000.0    7342000.0   -25249000.0           0.0           NaN           0.0
    33                Dividends (Preferred)           NaN           NaN           NaN          NaN          NaN  ...          NaN          NaN           NaN           NaN           NaN           NaN
    34   EPS (Basic, from Discontinued Ops)           NaN           NaN           NaN          NaN          NaN  ...          NaN          NaN           NaN           NaN           NaN           NaN
    35        EPS (Diluted, from Disc. Ops)           NaN           NaN           NaN          NaN          NaN  ...          NaN          NaN           NaN           NaN           NaN           NaN
    36  Income from Discontinued Operations           NaN           NaN           NaN          NaN          NaN  ...          NaN          NaN           NaN           NaN           NaN           NaN
    
    [37 rows x 41 columns]
    

    并保存data.csv


    或从该页面下载他们的 XLSX:

    url = 'https://stockrow.com/api/companies/VRTX/financials.xlsx?dimension=Q&section=Income%20Statement&sort=desc'
    
    df = pd.read_excel(url)
    pd.set_option('display.float_format', lambda x: '%.3f' % x)
    print(df)
    

    【讨论】:

    • 非常感谢您,这一切都很好!顺便说一句,你知道为什么第一种方法不显示一些行吗?它没有“收入”行等。再次感谢
    • @Steve.Kim 行在那里,但没有排序。 Revenue 行是行号。 21
    • data_url是怎么得到的?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-02-22
    • 1970-01-01
    • 1970-01-01
    • 2013-09-28
    • 1970-01-01
    • 2011-03-11
    相关资源
    最近更新 更多