【发布时间】:2020-05-20 16:05:42
【问题描述】:
我正在尝试从以下日期(2020 年 2 月 1 日至 2020 年 2 月 5 日)从该网站获取历史经济日历数据 - https://www.investing.com/economic-calendar/。
今天是 2020 年 2 月 4 日。
如果我使用下面的https://www.investing.com/economic-calendar/ url,我可以使用 beautifulsoup 提取表格,但我无法选择除当天以外的任何一天。我得到了一张保存在我的 Python 脚本中的表格,时间是今天(2020 年 2 月 4 日)。
import requests
import pandas as pd
from bs4 import BeautifulSoup
payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
"dateFrom":"2020-02-01",
"dateTo":"2020-02-05",
"timeZone":"8",
"timeFilter":"timeRemain",
"currentTab":"custom",
"limit_from":"0"}
urlheader = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
url = "https://www.investing.com/economic-calendar/"
req = requests.post(url, data=payload, headers=urlheader)
print(req)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="economicCalendarData")
我可以看到,每当我更改日期范围或过滤器设置时,它都会向“https://www.investing.com/economic-calendar/Service/getCalendarFilteredData”发送一个发布请求。
这是我找到的请求数据。
这里是 POST 链接
所以我改用下面的代码,因为我想选择日期。
import requests
import pandas as pd
from bs4 import BeautifulSoup
payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
"dateFrom":"2020-02-01",
"dateTo":"2020-02-05",
"timeZone":"8",
"timeFilter":"timeRemain",
"currentTab":"custom",
"limit_from":"0"}
urlheader = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
url = "https://www.investing.com/economic-calendar/Service/getCalendarFilteredData"
req = requests.post(url, data=payload, headers=urlheader)
print(req)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="economicCalendarData")
但是这次没有economicCalendarData,所以表变量出来是空的。 汤变量中有数据,但没有表数据。
这是我要保存的表格。
就像我之前说的,如果我使用 url 作为https://www.investing.com/economic-calendar/,我只能获取当天(2020 年 2 月 4 日)的表格数据;无论我在有效负载中输入什么日期(dateFrom、dateTo)。
由于某种原因,当我尝试向https://www.investing.com/economic-calendar/Service/getCalendarFilteredData 发帖时,表格显示为空,即使汤变量包含数据,它也不是我请求的数据。我究竟做错了什么?如何在我选择的日期保存表格?
【问题讨论】:
-
您应该添加(并查看)浏览器发送的请求标头的完整列表
-
我只看到了我上面提到的payload,我在哪里可以找到它们?
-
我认为您的
Here is the POST link屏幕截图中的滚动条是用于更多请求标头,但它可能用于响应标头,抱歉
标签: post web-scraping html-table beautifulsoup python-requests