使用日期选择器/日历进行网络抓取答案

【问题标题】：Web-scraping using datepick/calender使用日期选择器/日历进行网络抓取
【发布时间】：2019-08-02 08:43:35
【问题描述】：

我对网络抓取比较陌生。我在 excel 中使用 power-query 从 Web 导入数据。虽然，我已经为简单的 html 页面完成了它，但我发现对于像 this 这样的复杂页面很难。

要获取数据表，必须从网页上的日历中选择一个日期。例如。 2019 年 7 月 31 日。网址保持不变。

当我在 power-query 中建立链接时，我没有得到任何值或表。请看下面的快照

Snapshot - Blank table as shown in PowerQuery

请告诉我什么是解决方案？我想要一个日期的所有表格，然后循环多个日期。

提前谢谢你！

【问题讨论】：

标签： excel web-scraping powerquery

【解决方案1】：

网络抓取的挑战在于每个页面都有特定的内容。为了能够抓取报告，您需要使用浏览器的网络选项卡了解 Web 应用程序如何从服务器（在本例中）加载它们。

我注意到，当单击河流的查看报告时，此 url 被调用：

https://www.wbiwd.gov.in/index.php/applications/rivergreport

降雨报告：

https://www.wbiwd.gov.in/index.php/applications/raingreport

对于水库：

https://www.wbiwd.gov.in/index.php/applications/reservoirgreport

备注：

https://www.wbiwd.gov.in/index.php/applications/remarksreport

预测：

https://www.wbiwd.gov.in/index.php/applications/weather_report

特别提醒：

https://www.wbiwd.gov.in/index.php/applications/special_alert

右键 -> 复制 curl 会给你 chrome 中的 curl 请求，然后你可以将它移植到你喜欢的任何语言，这里是 python 中河流报告的示例：

import requests
from bs4 import BeautifulSoup

cookies = {
    'csrf_cookie_name': 'c00e8a9a3a049ed2d89b621b6c6912d1' #I think thie can be hardcoded
}

headers = {
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'# User agent not necessary needed
}

data = {
  'river_date': '2019-08-02', # this is the date you select from the calendar
  'submit': 'View Report', # the report type, this is the first table
  'csrf_test_name': 'c00e8a9a3a049ed2d89b621b6c6912d1'
}

response = requests.post('https://www.wbiwd.gov.in/index.php/applications/rivergreport', headers=headers, cookies=cookies, data=data)

html = response.text # html which is in fact the contents of the table that is displayed aget you request the report

print(html)

soup = BeautifulSoup(response.content, 'html.parser')

all_tr = soup.find_all('tr') # these are the table rows for the river data

for tr in all_tr:
    print(tr)

为了让这个脚本工作，你需要用 pip 安装 python3，那里有很多关于如何设置它的好教程。

然后在命令行中运行：

pip install requests beautifulsoup4

【讨论】：

在我的机器上工作正常，将整个代码复制到code.py中，然后在终端中执行'python3 code.py'，我看到了td的
你应该下载和使用 pycharm 之类的 IDE 并使用它。
只需下载 pycharm 并运行代码：如果你不相信我，这里是生成的 html pastebin.com/qjFYDjVm
感谢分享 html。我现在得到结果。与上面链接中列出的相同。不想用这个来打扰你，但是我怎样才能在 PowerQuery 中用它制作一个表格。与源网页上的相同。我想在 excel 或 csv 中为不同的日期建立一个数据库。请您帮助我调整此代码以在 txt 中包含日期范围和输出文件。对于这个请求，我很抱歉。
只需将其包装在接受日期作为参数的方法中，并在您的期间的每一天多次调用它。对我来说，我的帖子似乎回答了最初的问题，现在你面临着不同的问题，这些问题可能会在不同的问题中得到解决，或者只是在谷歌中搜索