网络抓取的挑战在于每个页面都有特定的内容。
为了能够抓取报告,您需要使用浏览器的网络选项卡了解 Web 应用程序如何从服务器(在本例中)加载它们。
我注意到,当单击河流的查看报告时,此 url 被调用:
https://www.wbiwd.gov.in/index.php/applications/rivergreport
降雨报告:
https://www.wbiwd.gov.in/index.php/applications/raingreport
对于水库:
https://www.wbiwd.gov.in/index.php/applications/reservoirgreport
备注:
https://www.wbiwd.gov.in/index.php/applications/remarksreport
预测:
https://www.wbiwd.gov.in/index.php/applications/weather_report
特别提醒:
https://www.wbiwd.gov.in/index.php/applications/special_alert
右键 -> 复制 curl 会给你 chrome 中的 curl 请求,然后你可以将它移植到你喜欢的任何语言,这里是 python 中河流报告的示例:
import requests
from bs4 import BeautifulSoup
cookies = {
'csrf_cookie_name': 'c00e8a9a3a049ed2d89b621b6c6912d1' #I think thie can be hardcoded
}
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'# User agent not necessary needed
}
data = {
'river_date': '2019-08-02', # this is the date you select from the calendar
'submit': 'View Report', # the report type, this is the first table
'csrf_test_name': 'c00e8a9a3a049ed2d89b621b6c6912d1'
}
response = requests.post('https://www.wbiwd.gov.in/index.php/applications/rivergreport', headers=headers, cookies=cookies, data=data)
html = response.text # html which is in fact the contents of the table that is displayed aget you request the report
print(html)
soup = BeautifulSoup(response.content, 'html.parser')
all_tr = soup.find_all('tr') # these are the table rows for the river data
for tr in all_tr:
print(tr)
为了让这个脚本工作,你需要用 pip 安装 python3,那里有很多关于如何设置它的好教程。
然后在命令行中运行:
pip install requests beautifulsoup4