【问题标题】:Trying to get the data from a site, which does not have an API, using BeautifulSoup尝试使用 BeautifulSoup 从没有 API 的站点获取数据
【发布时间】:2019-10-10 12:43:52
【问题描述】:

所以,我正在制作一个刮板,它会从网站上刮取表格数据,然后将其上传到天蓝色数据库中。我正在尝试使用 Beautiful Soup 抓取数据。该站点是https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php 问题是该站点的html 代码很粗糙。

</div>
<!-- main container-->
<div class="grid_18" id="main_container">
<div style="padding-left: 10px; padding-top: 5px;"><img 
src="images/hgen&amp;loadshed.jpg"/></div>
<head>
<style>
        tr:nth-child(even){
            background-color: #ccc;
    }
    tr:hover
    {
        background: #f7dcdf;
    }
</style>
</head>
<table class="layout display responsive-table"><tr>
<th style="text-align: center;">Date</th>
<th style="text-align: center;">Time</th>
<th style="text-align: center;">Generation</th>
<th style="text-align: center;">Demand</th>
<th style="text-align: center;">Shortage</th>
<th style="text-align: center;">Loadshed</th>
<th style="text-align: center;">Remark</th>
</tr> <tr>
<td style="text-align: center;">10-10-2019</td>
<td style="text-align: center;">09:00:00</td>
<td style="text-align: center;">7600.4</td>
<td style="text-align: center;">7600</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;"></td>
</tr>
<tr>
<td style="text-align: center;">10-10-2019</td>
<td style="text-align: center;">08:00:00</td>
<td style="text-align: center;">7165.2</td>
<td style="text-align: center;">7165</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;"></td>
</tr>
<tr>

到目前为止,我已经尝试了以下内容,并得到了上述结果以及其他一些文本,我可以稍后将其删除。但是,我需要从 日期 时间

<td style="text-align: center;">10-10-2019</td>
<td style="text-align: center;">09:00:00</td>

表格格式,如,

日期 |时间 |

10-10-2019 | 9:00:00|

这是我到目前为止所做的:

#import requests
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq # webclient

#scrapping from
page_url = "https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php"
uclient = uReq (page_url)

#parsing the html

page_soup = soup (uclient.read(), "html.parser")
uclient.close()

table1 = page_soup.findAll("table",{"class":"layout display responsive-table"})

请告诉我如何改进这一点并获得预期的结果。

【问题讨论】:

    标签: python html python-3.x web-scraping beautifulsoup


    【解决方案1】:

    BeautifulSoup 是一个很棒的工具。但是在这种特殊情况下,您可以使用 beautifulsoup 来做这件事,或者任何时候我看到 &lt;table&gt; 标签,我只使用 pandas 的 .read_html() 来完成这项工作(它在后台使用 BeautifulSoup),然后只需要把桌子收拾一下。它将返回所有表格标签的列表。在这种情况下,有 2 个表格标签,并且您想要的表格在索引位置 1:

    import pandas as pd
    
    url = 'https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php'
    
    tables = pd.read_html(url)
    df = tables[1]
    
    df = df[:-1]
    df = df.dropna(axis=1,how='all')
    

    输出:

    print (df.to_string())
              Date      Time Generation Demand Shortage Loadshed        Remark
    0   10-10-2019  18:00:00       9182   9182        0        0           NaN
    1   10-10-2019  17:00:00     8091.3   8091        0        0           NaN
    2   10-10-2019  16:00:00     8277.7   8278        0        0           NaN
    3   10-10-2019  15:00:00     8465.8   8466        0        0           NaN
    4   10-10-2019  14:00:00     8394.7   8395        0        0           NaN
    5   10-10-2019  13:00:00     8553.4   8553        0        0           NaN
    6   10-10-2019  12:00:00       8376   8376        0        0      Day Peak
    7   10-10-2019  11:00:00     8169.9   8170        0        0           NaN
    8   10-10-2019  10:00:00     7900.9   7901        0        0           NaN
    9   10-10-2019  09:00:00     7600.4   7600        0        0           NaN
    10  10-10-2019  08:00:00     7165.2   7165        0        0           NaN
    11  10-10-2019  07:00:00     6980.4   6980        0        0           NaN
    12  10-10-2019  06:00:00     7017.1   7017        0        0           NaN
    13  10-10-2019  05:00:00       7328   7328        0        0           NaN
    14  10-10-2019  04:00:00       7504   7504        0        0           NaN
    15  10-10-2019  03:00:00       7877   7877        0        0           NaN
    16  10-10-2019  02:00:00       8071   8071        0        0           NaN
    17  10-10-2019  01:00:00       8400   8400        0        0           NaN
    18  09-10-2019  24:00:00       8847   8847        0        0           NaN
    19  09-10-2019  23:00:00       9093   9093        0        0           NaN
    20  09-10-2019  22:00:00       9483   9483        0        0           NaN
    21  09-10-2019  21:00:00       9852   9852        0        0           NaN
    22  09-10-2019  20:00:00      10284  10284        0        0  Evening Peak
    23  09-10-2019  19:30:00      10229  10229        0        0           NaN
    24  09-10-2019  19:00:00      10211  10211        0        0           NaN
    25  09-10-2019  18:00:00       9538   9538        0        0           NaN
    

    附加

    如果您想了解它如何与 BeautifulSoup 一起工作,请展示如何迭代。 QHarr 还提供了另一种/更好的方法。

    import pandas as pd
    from bs4 import BeautifulSoup
    import requests
    
    url = 'https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php'
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    tables = soup.find_all('table')
    table = tables[1]
    
    headers = table.find_all('th')
    columns = [ td.text for td in headers ]
    
    df = pd.DataFrame()
    rows = table.find_all('tr')
    for row in rows:
        tds = row.find_all('td')
        data = [ td.text for td in tds ]
        temp_df = pd.DataFrame([data])
    
        df = df.append(temp_df, sort=True).reset_index(drop=True)
    
    df = df.dropna(axis=1,how='all')
    df = df.dropna(axis=0,how='all')
    df.columns = columns
    df = df[:-1]
    

    【讨论】:

    • 谢谢你,我对这种框架还是新手,你的插图真的有助于理解解决方案。
    【解决方案2】:

    我会针对单个表而不是全部检索;使用更快的 CSS 类选择器

    import pandas as pd
    from bs4 import BeautifulSoup as bs
    import requests
    
    r = requests.get('https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php')
    soup = bs(r.text, 'html.parser')
    df = pd.read_html(str(soup.select_one('.responsive-table')))
    print(df)
    

    【讨论】:

    • 感谢您的帮助!!你太棒了:)
    猜你喜欢
    • 2019-10-16
    • 1970-01-01
    • 2019-09-20
    • 2023-02-26
    • 2018-10-28
    • 2021-12-27
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多