使用 Python 进行网页抓取 - 表单输入的链接不变答案

【问题标题】：Webscraping using Python - Link is unchanged with form input使用 Python 进行网页抓取 - 表单输入的链接不变
【发布时间】：2017-02-15 13:03:17
【问题描述】：

我计划从可用的开放网络检索历史数据。从链接：

https://www.entsoe.eu/db-query/consumption/mhlv-a-specific-country-for-a-specific-day

理想情况下，我正在尝试使用来自 Pandas 数据框的输入来更改国家、日、月、年，并检索结果（此网页中的能源消耗）并存储回 excel。

我正在尝试使用不同的网络抓取工具，但其中一条信息对我来说是可疑的。

它是：当我手动更改国家、日、月、年以及检索结果时，网络链接保持不变。是否可以通过此 Web 链接实现我的目标。

感谢您的宝贵时间。

【问题讨论】：

提交的数据是通过对网站的 POST 请求完成的，而不是 GET 请求，因此 URL 不会改变。多个建议的框架可用于发送帖子请求或提交表单。
您可以在 Chrome/Firefox 中使用Developer Tools 来查看 POST 请求中的数据。
非常感谢@Rejected 和 furas 的知识分享。

标签： python web-scraping beautifulsoup scrapy mechanize

【解决方案1】：

首先，您需要了解单击“发送”按钮时会发生什么。 POST 请求将发送到同一个 URL，其参数对应于您在表单上选择的值。您可以在浏览器开发者工具 - “网络”选项卡中看到此请求。现在，你需要在你的代码中模拟这个请求（我将在下面使用很棒的requests package）

另一个问题是，如果您检查您在对该 POST 请求的响应中获得的内容，您将找不到与在浏览器中看到的具有所需数据相同的 table 元素。这是因为table 是从script 元素中的myData javascript 变量“坐”动态生成的。由于BeautifiulSoup和requests都不是浏览器，无法执行JavaScript，所以需要从脚本中提取myData的值。

这是一个工作代码，可以在 2009 年 1 月 1 日的“存档”范围内为您提供所需的数据：

import re
from ast import literal_eval
from pprint import pprint

import requests
from bs4 import BeautifulSoup


url = "https://www.entsoe.eu/db-query/consumption/mhlv-a-specific-country-for-a-specific-day"
data = {
    "opt_period": "2",
    "opt_Country": "3",
    "opt_Day": "1",
    "opt_Month": "1",
    "opt_Year": "2009",
    "opt_Response": "1",
    "send": "send"
}
with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'}
    # visit the page
    session.get(url)

    # make a POST request
    response = session.post(url, data=data)
    soup = BeautifulSoup(response.content, 'html.parser')

    # find the desired script
    pattern = re.compile(r"var myData = (.*?);", re.MULTILINE | re.DOTALL)
    script = soup.find("script", text=pattern)

    # extract the data from the script
    match = pattern.search(script.get_text())
    data = match.group(1).strip()
    data = literal_eval(data)

    pprint(data)

打印一个 Python 列表：

[['AT',
  '2009-01-01',
  6277,
  6002,
  5649,
  5230,
  5034,
  5038,
  4858,
  5127,
  5342,
  5747,
  6100,
  6373,
  6325,
  6210,
  6129,
  6160,
  6588,
  7007,
  7058,
  6887,
  6586,
  6137,
  6494,
  5974]]

【讨论】：

非常感谢。我从你的回答中学到了很多。我非常尊重你的时间，采取:)