尝试使用 BeautifulSoup 从没有 API 的站点获取数据答案

【问题标题】：Trying to get the data from a site, which does not have an API, using BeautifulSoup尝试使用 BeautifulSoup 从没有 API 的站点获取数据
【发布时间】：2019-10-10 12:43:52
【问题描述】：

所以，我正在制作一个刮板，它会从网站上刮取表格数据，然后将其上传到天蓝色数据库中。我正在尝试使用 Beautiful Soup 抓取数据。该站点是https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php 问题是该站点的html 代码很粗糙。

</div>
<!-- main container-->
<div class="grid_18" id="main_container">
<div style="padding-left: 10px; padding-top: 5px;"><img 
src="images/hgen&amp;loadshed.jpg"/></div>
<head>
<style>
        tr:nth-child(even){
            background-color: #ccc;
    }
    tr:hover
    {
        background: #f7dcdf;
    }
</style>
</head>
<table class="layout display responsive-table"><tr>
<th style="text-align: center;">Date</th>
<th style="text-align: center;">Time</th>
<th style="text-align: center;">Generation</th>
<th style="text-align: center;">Demand</th>
<th style="text-align: center;">Shortage</th>
<th style="text-align: center;">Loadshed</th>
<th style="text-align: center;">Remark</th>
</tr> <tr>
<td style="text-align: center;">10-10-2019</td>
<td style="text-align: center;">09:00:00</td>
<td style="text-align: center;">7600.4</td>
<td style="text-align: center;">7600</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;"></td>
</tr>
<tr>
<td style="text-align: center;">10-10-2019</td>
<td style="text-align: center;">08:00:00</td>
<td style="text-align: center;">7165.2</td>
<td style="text-align: center;">7165</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;"></td>
</tr>
<tr>

到目前为止，我已经尝试了以下内容，并得到了上述结果以及其他一些文本，我可以稍后将其删除。但是，我需要从日期时间

<td style="text-align: center;">10-10-2019</td>
<td style="text-align: center;">09:00:00</td>

表格格式，如，

日期 |时间 |

10-10-2019 | 9:00:00|

这是我到目前为止所做的：

#import requests
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq # webclient

#scrapping from
page_url = "https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php"
uclient = uReq (page_url)

#parsing the html

page_soup = soup (uclient.read(), "html.parser")
uclient.close()

table1 = page_soup.findAll("table",{"class":"layout display responsive-table"})

请告诉我如何改进这一点并获得预期的结果。

【问题讨论】：

标签： python html python-3.x web-scraping beautifulsoup

【解决方案1】：

BeautifulSoup 是一个很棒的工具。但是在这种特殊情况下，您可以使用 beautifulsoup 来做这件事，或者任何时候我看到 <table> 标签，我只使用 pandas 的 .read_html() 来完成这项工作（它在后台使用 BeautifulSoup），然后只需要把桌子收拾一下。它将返回所有表格标签的列表。在这种情况下，有 2 个表格标签，并且您想要的表格在索引位置 1：

import pandas as pd

url = 'https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php'

tables = pd.read_html(url)
df = tables[1]

df = df[:-1]
df = df.dropna(axis=1,how='all')

输出：

print (df.to_string())
          Date      Time Generation Demand Shortage Loadshed        Remark
0   10-10-2019  18:00:00       9182   9182        0        0           NaN
1   10-10-2019  17:00:00     8091.3   8091        0        0           NaN
2   10-10-2019  16:00:00     8277.7   8278        0        0           NaN
3   10-10-2019  15:00:00     8465.8   8466        0        0           NaN
4   10-10-2019  14:00:00     8394.7   8395        0        0           NaN
5   10-10-2019  13:00:00     8553.4   8553        0        0           NaN
6   10-10-2019  12:00:00       8376   8376        0        0      Day Peak
7   10-10-2019  11:00:00     8169.9   8170        0        0           NaN
8   10-10-2019  10:00:00     7900.9   7901        0        0           NaN
9   10-10-2019  09:00:00     7600.4   7600        0        0           NaN
10  10-10-2019  08:00:00     7165.2   7165        0        0           NaN
11  10-10-2019  07:00:00     6980.4   6980        0        0           NaN
12  10-10-2019  06:00:00     7017.1   7017        0        0           NaN
13  10-10-2019  05:00:00       7328   7328        0        0           NaN
14  10-10-2019  04:00:00       7504   7504        0        0           NaN
15  10-10-2019  03:00:00       7877   7877        0        0           NaN
16  10-10-2019  02:00:00       8071   8071        0        0           NaN
17  10-10-2019  01:00:00       8400   8400        0        0           NaN
18  09-10-2019  24:00:00       8847   8847        0        0           NaN
19  09-10-2019  23:00:00       9093   9093        0        0           NaN
20  09-10-2019  22:00:00       9483   9483        0        0           NaN
21  09-10-2019  21:00:00       9852   9852        0        0           NaN
22  09-10-2019  20:00:00      10284  10284        0        0  Evening Peak
23  09-10-2019  19:30:00      10229  10229        0        0           NaN
24  09-10-2019  19:00:00      10211  10211        0        0           NaN
25  09-10-2019  18:00:00       9538   9538        0        0           NaN

附加

如果您想了解它如何与 BeautifulSoup 一起工作，请展示如何迭代。 QHarr 还提供了另一种/更好的方法。

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = 'https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

tables = soup.find_all('table')
table = tables[1]

headers = table.find_all('th')
columns = [ td.text for td in headers ]

df = pd.DataFrame()
rows = table.find_all('tr')
for row in rows:
    tds = row.find_all('td')
    data = [ td.text for td in tds ]
    temp_df = pd.DataFrame([data])

    df = df.append(temp_df, sort=True).reset_index(drop=True)

df = df.dropna(axis=1,how='all')
df = df.dropna(axis=0,how='all')
df.columns = columns
df = df[:-1]

【讨论】：

谢谢你，我对这种框架还是新手，你的插图真的有助于理解解决方案。

【解决方案2】：

我会针对单个表而不是全部检索；使用更快的 CSS 类选择器

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php')
soup = bs(r.text, 'html.parser')
df = pd.read_html(str(soup.select_one('.responsive-table')))
print(df)

【讨论】：

感谢您的帮助！！你太棒了:)