无法从请求 python 中获取完整表答案

【问题标题】：Can't get full table from requests python无法从请求 python 中获取完整表
【发布时间】：2019-03-06 01:47:21
【问题描述】：

我正在尝试从该网站获取整个表格：https://br.investing.com/commodities/aluminum-historical-data

但是当我发送这段代码时：

with requests.Session() as s:
r = s.post('https://br.investing.com/commodities/aluminum-historical-data',
                 headers={"curr_id": "49768","smlID": "300586","header": "Alumínio Futuros Dados Históricos",
                          'User-Agent': 'Mozilla/5.0', 'st_date': '01/01/2017','end_date': '29/09/2018',
                         'interval_sec': 'Daily','sort_col': 'date','sort_ord': 'DESC','action': 'historical_data'})

bs2 = BeautifulSoup(r.text,'lxml')
tb = bs2.find('table',{"id":"curr_table"})

它只返回表格的一部分，而不是我刚刚过滤的整个日期。

我确实看到了下面的帖子页面：

谁能帮我得到我刚刚过滤的整个表格？

【问题讨论】：

如果您实际上没有提交任何帖子数据，为什么还要使用POST？另外，如果您只执行一项操作，为什么还要使用会话？
因为我尝试了没有会话并得到了相同的结果。然后我开始尝试一切
你从哪里得到这些标头值？
在我发布的照片中
那些不是标题，而是表单数据。你明白它怎么说“表单数据”了吗？

标签： python post web-scraping beautifulsoup python-requests

【解决方案1】：

问题是您将表单 data 作为 headers 传递。

您必须在request.Session.post 中使用data 关键字参数发送数据：

with requests.Session() as session:

    url = 'https://br.investing.com/commodities/aluminum-historical-data'

    data = {
        "curr_id": "49768",
        "smlID": "300586",
        "header": "Alumínio Futuros Dados Históricos",
        'User-Agent': 'Mozilla/5.0',
        'st_date': '01/01/2017',
        'end_date': '29/09/2018',
        'interval_sec': 'Daily',
        'sort_col': 'date',
        'sort_ord': 'DESC',
        'action': 'historical_data',
        }

    your_headers = {}  # your headers here

    response = session.post(url, data=data, headers=your_headers)

bs2 = BeautifulSoup(response.text,'lxml')
tb = bs2.find('table',{"id":"curr_table"})

我还建议在 POST 请求中包含您的标头（尤其是 user-agents），因为该站点不允许机器人。在这种情况下，如果您这样做，将更难检测到机器人。

【讨论】：

它也没有用。它仍然只返回到 09/03/2018 (MMDDYYYY)
您能告诉我响应状态码吗？也许 Beautifulsoup 没有正确抓取页面。尝试在 tb.prettify() 中查找数据。

【解决方案2】：

您的代码犯了两个错误。

第一个是网址。您需要使用正确的 URL 向investing.com 请求数据。您当前的url 是'https://br.investing.com/commodities/aluminum-historical-data'

但是，当您看到检查并单击'Network' 时，Request URL 是https://br.investing.com/instruments/HistoricalDataAjax。

您的第二个错误存在于s.post(blah)。正如 Federico Rubbi 上面提到的，分配给 headers 的代码必须改为分配给 data。

现在，你的错误都解决了。你只需要多做一步。您必须将字典 {'X-Requested-With': 'XMLHttpRequest'} 添加到 your_headers。从您的代码中可以看出，您已经在HTML inspection 中检查了Network tab。因此，您可能会明白为什么需要{'X-Requested-With': 'XMLHttpRequest'}。

所以整个代码应该如下。

import requests
import bs4 as bs

with requests.Session() as s:
    url = 'https://br.investing.com/instruments/HistoricalDataAjax' # Making up for the first mistake.
    your_headers = {'User-Agent': 'Mozilla/5.0'}

    s.get(url, headers= your_headers)
    c_list = s.cookies.get_dict().items()
    cookie_list = [key+'='+value for key,value in c_list]
    cookie = ','.join(cookie_list)

    your_headers = {**{'X-Requested-With': 'XMLHttpRequest'},**your_headers}
    your_headers['Cookie'] = cookie

    data= {} # Your data. Making up for the second mistake.

    response = s.post(url, data= data, headers = your_headers)

【讨论】：