【问题标题】:Scraping one column from a HTML table using BeautifulSoup使用 BeautifulSoup 从 HTML 表中抓取一列
【发布时间】:2019-10-11 02:57:13
【问题描述】:

我知道关于 BeautifulSoup 的问题有很多,但是在尝试了一些事情之后,我无法弄清楚如何从这个 HTML 表中解析我需要的数据。

我的桌子是这样的:

<table class="W(100%) M(0)" data-test="historical-prices" data-reactid="33">
    <thead data-reactid="34">
        <tr class="C($tertiaryColor) Fz(xs) Ta(end)" data-reactid="35">
            <th class="Ta(start) W(100px) Fw(400) Py(6px)" data-reactid="36"><span data-reactid="37">Date</span></th>
            <th class="Fw(400) Py(6px)" data-reactid="38"><span data-reactid="39">Open</span></th>
            <th class="Fw(400) Py(6px)" data-reactid="40"><span data-reactid="41">High</span></th>
            <th class="Fw(400) Py(6px)" data-reactid="42"><span data-reactid="43">Low</span></th>
            <th class="Fw(400) Py(6px)" data-reactid="44"><span data-reactid="45">Close*</span></th>
            <th class="Fw(400) Py(6px)" data-reactid="46"><span data-reactid="47">Adj Close**</span></th>
            <th class="Fw(400) Py(6px)" data-reactid="48"><span data-reactid="49">Volume</span></th>
        </tr>
    </thead>
    <tbody data-reactid="50">
        <tr class="BdT Bdc($seperatorColor) Ta(end) Fz(s) Whs(nw)" data-reactid="51">
            <td class="Py(10px) Ta(start) Pend(10px)" data-reactid="52"><span data-reactid="53">Oct 10, 2019</span></td>
            <td class="Py(10px) Pstart(10px)" data-reactid="54"><span data-reactid="55">2,918.55</span></td>
            <td class="Py(10px) Pstart(10px)" data-reactid="56"><span data-reactid="57">2,948.46</span></td>
            <td class="Py(10px) Pstart(10px)" data-reactid="58"><span data-reactid="59">2,917.12</span></td>
            <td class="Py(10px) Pstart(10px)" data-reactid="60"><span data-reactid="61">2,938.13</span></td>
            <td class="Py(10px) Pstart(10px)" data-reactid="62"><span data-reactid="63">2,938.13</span></td>
            <td class="Py(10px) Pstart(10px)" data-reactid="64"><span data-reactid="65">3,217,250,000</span></td>
        </tr>
        <tr class="BdT Bdc($seperatorColor) Ta(end) Fz(s) Whs(nw)" data-reactid="66">
            <td class="Py(10px) Ta(start) Pend(10px)" data-reactid="67"><span data-reactid="68">Oct 09, 2019</span></td>
            <td class="Py(10px) Pstart(10px)" data-reactid="69"><span data-reactid="70">2,911.10</span></td>
            <td class="Py(10px) Pstart(10px)" data-reactid="71"><span data-reactid="72">2,929.32</span></td>
            <td class="Py(10px) Pstart(10px)" data-reactid="73"><span data-reactid="74">2,907.41</span></td>
            <td class="Py(10px) Pstart(10px)" data-reactid="75"><span data-reactid="76">2,919.40</span></td>
            <td class="Py(10px) Pstart(10px)" data-reactid="77"><span data-reactid="78">2,919.40</span></td>
            <td class="Py(10px) Pstart(10px)" data-reactid="79"><span data-reactid="80">2,726,820,000</span></td>
        </tr>
</table>

我想从“Adj Close”列中提取数据。 我遇到的问题是所有&lt;td&gt; 类属性都具有相同的值。

如何只提取“Adj Close”列中的所有数据?

到目前为止,这是我的代码:

import pandas as pd
import numpy as np

raw_html = simple_get('https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC')
soup = BeautifulSoup(raw_html, 'html.parser')
table = soup.find("table", attrs={'class':'W(100%) M(0)'})
stock_history_list=[]

try:
    for row in table.find_all('tr'):
        cols = row.find_all('td')
        print(cols)
        if len(cols) > 0:
            stock_history_list.append(cols[5].text.strip())
except: pass  

stock_history_array = np.asarray(stock_history_list)
df = pd.DataFrame(stock_history_array)

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    您可以将 HTML 转换为字典列表以便快速查找:

    header, *data = [[i.text for i in b.find_all('th' if not b.td else 'td')] for b in d.find_all('tr')]
    result = [dict(zip(header, i)) for i in data]
    vals = [i['Adj Close**'] for i in result]
    

    或者,使用pandas

    import pandas as pd
    df = pd.DataFrame(result)
    vals = df['Adj Close**']
    

    输出:

    0    2,938.13
    1    2,919.40
    Name: Adj Close**, dtype: object
    

    【讨论】:

      【解决方案2】:

      您可以使用 nth-of-type (如果您知道索引,则直接指定,或者我将展示如何根据标题获取它)。使用 bs4 4.7.1+

      from bs4 import BeautifulSoup as bs
      
      html = '''<table class="W(100%) M(0)" data-test="historical-prices" data-reactid="33">
          <thead data-reactid="34">
              <tr class="C($tertiaryColor) Fz(xs) Ta(end)" data-reactid="35">
                  <th class="Ta(start) W(100px) Fw(400) Py(6px)" data-reactid="36"><span data-reactid="37">Date</span></th>
                  <th class="Fw(400) Py(6px)" data-reactid="38"><span data-reactid="39">Open</span></th>
                  <th class="Fw(400) Py(6px)" data-reactid="40"><span data-reactid="41">High</span></th>
                  <th class="Fw(400) Py(6px)" data-reactid="42"><span data-reactid="43">Low</span></th>
                  <th class="Fw(400) Py(6px)" data-reactid="44"><span data-reactid="45">Close*</span></th>
                  <th class="Fw(400) Py(6px)" data-reactid="46"><span data-reactid="47">Adj Close**</span></th>
                  <th class="Fw(400) Py(6px)" data-reactid="48"><span data-reactid="49">Volume</span></th>
              </tr>
          </thead>
          <tbody data-reactid="50">
              <tr class="BdT Bdc($seperatorColor) Ta(end) Fz(s) Whs(nw)" data-reactid="51">
                  <td class="Py(10px) Ta(start) Pend(10px)" data-reactid="52"><span data-reactid="53">Oct 10, 2019</span></td>
                  <td class="Py(10px) Pstart(10px)" data-reactid="54"><span data-reactid="55">2,918.55</span></td>
                  <td class="Py(10px) Pstart(10px)" data-reactid="56"><span data-reactid="57">2,948.46</span></td>
                  <td class="Py(10px) Pstart(10px)" data-reactid="58"><span data-reactid="59">2,917.12</span></td>
                  <td class="Py(10px) Pstart(10px)" data-reactid="60"><span data-reactid="61">2,938.13</span></td>
                  <td class="Py(10px) Pstart(10px)" data-reactid="62"><span data-reactid="63">2,938.13</span></td>
                  <td class="Py(10px) Pstart(10px)" data-reactid="64"><span data-reactid="65">3,217,250,000</span></td>
              </tr>
              <tr class="BdT Bdc($seperatorColor) Ta(end) Fz(s) Whs(nw)" data-reactid="66">
                  <td class="Py(10px) Ta(start) Pend(10px)" data-reactid="67"><span data-reactid="68">Oct 09, 2019</span></td>
                  <td class="Py(10px) Pstart(10px)" data-reactid="69"><span data-reactid="70">2,911.10</span></td>
                  <td class="Py(10px) Pstart(10px)" data-reactid="71"><span data-reactid="72">2,929.32</span></td>
                  <td class="Py(10px) Pstart(10px)" data-reactid="73"><span data-reactid="74">2,907.41</span></td>
                  <td class="Py(10px) Pstart(10px)" data-reactid="75"><span data-reactid="76">2,919.40</span></td>
                  <td class="Py(10px) Pstart(10px)" data-reactid="77"><span data-reactid="78">2,919.40</span></td>
                  <td class="Py(10px) Pstart(10px)" data-reactid="79"><span data-reactid="80">2,726,820,000</span></td>
              </tr>
      </table>'''
      soup = bs(html, 'lxml')
      index = [th.text for th in soup.select('[data-test="historical-prices"] th')].index('Adj Close**') + 1
      data = [td.text for td in soup.select(f'[data-test="historical-prices"] td:nth-of-type({index})')]
      print(data)
      

      【讨论】:

        猜你喜欢
        • 2021-05-31
        • 2016-10-28
        • 1970-01-01
        • 2012-10-31
        • 1970-01-01
        • 2016-05-12
        • 2021-01-11
        • 1970-01-01
        • 2019-12-01
        相关资源
        最近更新 更多