【问题标题】:Extract HTML Tables With Similar Data from Different Sources with Different Formatting - Python从具有不同格式的不同来源中提取具有相似数据的 HTML 表 - Python
【发布时间】:2020-07-14 12:32:42
【问题描述】:

我正在尝试从两个不同的 HTML 源中抓取 HTML 表格。两者都非常相似,每个表都包含相同的数据,但它们的结构可能不同,列名不同等。对于一个源,所有数据可能包含在一个表中,而另一个源可能将数据分解为两个单独的表。

例如,我们可以查看 AAPL 和 MMM 股票的内部持有人。

截图在这里 - https://imgur.com/a/OihTSZR

假设最终目标是提取内部人员持有的股份总数 - 一个单数。每个表格的结构可能不同,但应该相似的是关键字,例如“Beneficially”或“Stock”。

任何帮助将不胜感激。在上一篇文章中,我能够提取一些数据。但如果结构不同,则不能循环或重复。

Extract HTML Table Based on Specific Column Headers - Python

df = pd.read_html("https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm", attrs={'style': 'border-collapse: collapse; width: 100%; font: 9pt Arial, Helvetica, Sans-Serif'}, match="Name/address")

df = df[0]
df = df.dropna(axis = 'columns')

也尝试过 BS


url = 'https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm'
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.find_all('table')
rows = tables.find_all('tr')

【问题讨论】:

  • @αԋɱҽԃαмєяιcαη 这真的很好。无论如何让函数返回一个单一的输出,即内部人员持有的股份的总和?这是它可能变得棘手的地方。因为 AAPL 将所有内部人员放在一张桌子上,而 MMM 将他们放在两张桌子上。该函数在我的机器上运行良好,但在 CSV 返回时,我需要手动输入并尝试汇总份额。
  • 如果我的回答对您有帮助,请在答案旁边打勾。

标签: python html web-scraping beautifulsoup


【解决方案1】:

这真的很复杂,但我们开始吧:)。

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd


urls = ['https://www.sec.gov/Archives/edgar/data/320193/000119312520001450/d799303ddef14a.htm',
        'https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm']


def main(urls):
    with requests.Session() as req:
        for url in urls:
            r = req.get(url)
            soup = BeautifulSoup(r.content, 'html.parser')
            for item in soup.findAll("a", text=re.compile("^Security")):
                item = item.get("href")[1:]
                catch = soup.find("a", {'name': item}).find_next("table")
                df = pd.read_html(str(catch))
                print(df)
                df[0].to_csv(f"{item}.csv", index=False, header=None)


main(urls)

输出:

[                                                    0  ...    8
0                                                 NaN  ...  NaN
1                                                 NaN  ...  NaN
2                            Name of Beneficial Owner  ...  NaN
3                                                 NaN  ...  NaN
4                                  The Vanguard Group  ...    %
5                                                 NaN  ...  NaN
6                                     BlackRock, Inc.  ...    %
7                                                 NaN  ...  NaN
8         Berkshire Hathaway Inc. / Warren E. Buffett  ...    %
9                                                 NaN  ...  NaN
10                                         Kate Adams  ...  NaN
11                                                NaN  ...  NaN
12                                    Angela Ahrendts  ...  NaN
13                                                NaN  ...  NaN
14                                         James Bell  ...  NaN
15                                                NaN  ...  NaN
16                                           Tim Cook  ...  NaN
17                                                NaN  ...  NaN
18                                            Al Gore  ...  NaN
19                                                NaN  ...  NaN
20                                        Andrea Jung  ...  NaN
21                                                NaN  ...  NaN
22                                       Art Levinson  ...  NaN
23                                                NaN  ...  NaN
24                                       Luca Maestri  ...  NaN
25                                                NaN  ...  NaN
26                                    Deirdre O’Brien  ...  NaN
27                                                NaN  ...  NaN
28                                          Ron Sugar  ...  NaN
29                                                NaN  ...  NaN
30                                         Sue Wagner  ...  NaN
31                                                NaN  ...  NaN
32                                      Jeff Williams  ...  NaN
33                                                NaN  ...  NaN
34  All current executive officers and directors a...  ...  NaN

[35 rows x 9 columns]]
[                                                   0   1   ...                18  19 
0                        Name  and principal position NaN  ...  Percent of Class NaN  
1                    Thomas “Tony” K. Brown, Director NaN  ...               (5) NaN  
2                           Pamela J. Craig, Director NaN  ...               (5) NaN  
3                           David B. Dillon, Director NaN  ...               (5) NaN  
4                          Michael L. Eskew, Director NaN  ...               (5) NaN  
5                         Herbert L. Henkel, Director NaN  ...               (5) NaN  
6                               Amy E. Hood, Director NaN  ...               (5) NaN  
7                               Muhtar Kent, Director NaN  ...               (5) NaN  
8                           Edward M. Liddy, Director NaN  ...               (5) NaN  
9                           Dambisa F. Moyo, Director NaN  ...               (5) NaN  
10                          Gregory R. Page, Director NaN  ...               (5) NaN  
11                       Patricia A. Woertz, Director NaN  ...               (5) NaN  
12  Michael F. Roman, Chairman of the Board, Presi... NaN  ...               (5) NaN  
13  Inge G. Thulin, Former Executive Chairman of t... NaN  ...               (5) NaN  
14  Nicholas C. Gangestad, Senior Vice President a... NaN  ...               (5) NaN  
15  Ashish K. Khandpur, Executive Vice President, ... NaN  ...               (5) NaN  
16  Julie L. Bushman, Executive Vice President, In... NaN  ...               (5) NaN  
17  Joaquin Delgado, Former Executive Vice Preside... NaN  ...               (5) NaN  
18  Michael G. Vale, Executive Vice President, Saf... NaN  ...               (5) NaN  
19  All Directors and Executive Officers as a Grou... NaN  ...               (5) NaN  

[20 rows x 20 columns]]
[                                                   0   1  ...                  6   7 
0                                       Name/address NaN  ...  Percent  of Class NaN  
1  The Vanguard Group(1) 100 Vanguard Blvd. Malve... NaN  ...               8.78 NaN  
2  State Street Corporation(2) State Street Finan... NaN  ...               7.36 NaN  
3  BlackRock, Inc.(3) 55 East 52nd Street New Yor... NaN  ...               7.30 NaN  

[4 rows x 8 columns]]

【讨论】:

    猜你喜欢
    • 2017-02-27
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-09-29
    • 1970-01-01
    • 1970-01-01
    • 2017-01-29
    相关资源
    最近更新 更多