【问题标题】:Using python requests to post - How do I get the correct table data I request?使用 python 请求发布 - 如何获取我请求的正确表格数据?
【发布时间】:2020-05-20 16:05:42
【问题描述】:

我正在尝试从以下日期(2020 年 2 月 1 日至 2020 年 2 月 5 日)从该网站获取历史经济日历数据 - https://www.investing.com/economic-calendar/

今天是 2020 年 2 月 4 日。

如果我使用下面的https://www.investing.com/economic-calendar/ url,我可以使用 beautifulsoup 提取表格,但我无法选择除当天以外的任何一天。我得到了一张保存在我的 Python 脚本中的表格,时间是今天(2020 年 2 月 4 日)。

import requests
import pandas as pd
from bs4 import BeautifulSoup

payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
                "dateFrom":"2020-02-01",
                "dateTo":"2020-02-05",
                "timeZone":"8",
                "timeFilter":"timeRemain",
                "currentTab":"custom",
                "limit_from":"0"}

urlheader = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
}

url = "https://www.investing.com/economic-calendar/"

req = requests.post(url, data=payload, headers=urlheader)
print(req)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="economicCalendarData")

表变量如下所示

我可以看到,每当我更改日期范围或过滤器设置时,它都会向“https://www.investing.com/economic-calendar/Service/getCalendarFilteredData”发送一个发布请求。

这是我找到的请求数据。

这里是 POST 链接

所以我改用下面的代码,因为我想选择日期。

import requests
import pandas as pd
from bs4 import BeautifulSoup

payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
                "dateFrom":"2020-02-01",
                "dateTo":"2020-02-05",
                "timeZone":"8",
                "timeFilter":"timeRemain",
                "currentTab":"custom",
                "limit_from":"0"}

urlheader = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
}

url = "https://www.investing.com/economic-calendar/Service/getCalendarFilteredData"

req = requests.post(url, data=payload, headers=urlheader)
print(req)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="economicCalendarData")

但是这次没有economicCalendarData,所以表变量出来是空的。 汤变量中有数据,但没有表数据。

这是我要保存的表格。

就像我之前说的,如果我使用 url 作为https://www.investing.com/economic-calendar/,我只能获取当天(2020 年 2 月 4 日)的表格数据;无论我在有效负载中输入什么日期(dateFrom、dateTo)。

由于某种原因,当我尝试向https://www.investing.com/economic-calendar/Service/getCalendarFilteredData 发帖时,表格显示为空,即使汤变量包含数据,它也不是我请求的数据。我究竟做错了什么?如何在我选择的日期保存表格?

【问题讨论】:

  • 您应该添加(并查看)浏览器发送的请求标头的完整列表
  • 我只看到了我上面提到的payload,我在哪里可以找到它们?
  • 我认为您的Here is the POST link 屏幕截图中的滚动条是用于更多请求标头,但它可能用于响应标头,抱歉

标签: post web-scraping html-table beautifulsoup python-requests


【解决方案1】:

你真的很亲密。如果我了解您的要求,以下内容应该可以帮助您:

import requests
from bs4 import BeautifulSoup

url = "https://www.investing.com/economic-calendar/Service/getCalendarFilteredData"

payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
                "dateFrom":"2020-02-01",
                "dateTo":"2020-02-05",
                "timeZone":"8",
                "timeFilter":"timeRemain",
                "currentTab":"custom",
                "limit_from":"0"}

req = requests.post(url, data=payload, headers={
    "User-Agent":"Mozilla/5.0",
    "X-Requested-With": "XMLHttpRequest"
    })
soup = BeautifulSoup(req.json()['data'],"lxml")
for items in soup.select("tr"):
    data = [item.get_text(strip=True) for item in items.select("th,td")]
    print(data)

【讨论】:

  • 你好,如果你不介意,你能解释一下 json['data'] 部分吗?
  • 该请求为您提供 json 内容。但是,您感兴趣的表格数据位于您需要使用 BeautifulSoup 处理的数据键中。
猜你喜欢
  • 2014-06-01
  • 2015-10-05
  • 1970-01-01
  • 2022-01-25
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-08-17
相关资源
最近更新 更多