使用 python 在 Barchart 网站上抓取表格答案

【问题标题】：Scraping a table on Barchart website using python使用 python 在 Barchart 网站上抓取表格
【发布时间】：2021-12-05 02:06:45
【问题描述】：

Scraping an AJAX web page using python and requests

我使用上面链接中的脚本在 Barchart 网站上获取了一个表格，但它最近以某种方式停止工作，并出现错误消息 {'error': {'message': 'The payload is invalid.', 'code': 400 }}。我猜一些归档名称已经更改，但我对网络扫描很陌生，我不知道如何修复它。有什么建议吗？

import requests

geturl=r'https://www.barchart.com/futures/quotes/CLJ19/all-futures'
apiurl=r'https://www.barchart.com/proxies/core-api/v1/quotes/get'


getheaders={

    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'max-age=0',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
    }

getpay={
    'page': 'all'
}

s=requests.Session()
r=s.get(geturl,params=getpay, headers=getheaders)



headers={
    'accept': 'application/json',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'referer': 'https://www.barchart.com/futures/quotes/CLJ19/all-futures?page=all',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
    'x-xsrf-token': s.cookies.get_dict()['XSRF-TOKEN']

}
payload={
    'fields': 'symbol,contractSymbol,lastPrice,priceChange,openPrice,highPrice,lowPrice,previousPrice,volume,openInterest,tradeTime,symbolCode,symbolType,hasOptions',
    'list': 'futures.contractInRoot',
    'root': 'CL',
    'meta': 'field.shortName,field.type,field.description',
    'hasOptions': 'true',
    'raw': '1'

}


r=s.get(apiurl,params=payload,headers=headers)
j=r.json()
print(j)

OUT: {'error': {'message': 'payload is invalid.', 'code': 400}}

【问题讨论】：

标签： python

【解决方案1】：

这也发生在我身上。这是因为网站从内部 API 获取表，并且应该对 cookie 进行解码以避免此错误。

试试这个解决方案：

1- 在代码开头导入 unquote 函数

from urllib.parse import unquote

2- 改变这一行：

'x-xsrf-token': s.cookies.get_dict()['XSRF-TOKEN']

到这里：

'x-xsrf-token': unquote(unquote(s.cookies.get_dict()['XSRF-TOKEN']))

【讨论】：

谢谢，这很有帮助，你能分享一下你是如何调试这个问题并找出需要解码的'x-xsrf-token'吗？谢谢！ @艾哈迈德萨布里

【解决方案2】：

你好， ce script récupére bien les données du site mais je ne sais pas comment lire la valeur ('lastPrice')">30.67

#! /usr/bin/env python3
#-*- coding:Utf8 -*-

from lxml import html
import urllib3.request
import urllib3
import time

url = "https://www.barchart.com/stocks/quotes/$VIX"
http = urllib3.PoolManager()
r = http.request('GET', url, headers={'User-agent':'Mozilla/5.0 (Windows NT 5.1) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.16 
Safari/537.36','Cookie':'cookie_name=cookie_value'})
time.sleep(0.5)
page = r.data
data_string = page.decode('utf-8', errors='ignore')
tree = html.fromstring(data_string)
x_pat= "//*[@id='main-content-column']/div/div[1]/div[3]/span[1]"
#element cible:  <span class="last-change" data-ng- 
#class="highlightValue('lastPrice')">30.67</span>
valtab = tree.xpath(x_pat)
print(valtab) # [<Element span at 0x38573b0>]
for line in valtab:
    l=line.items()
    print(l) #[('class', 'bold')]

【讨论】：

你好，这个脚本很好地检索了站点数据，但我不知道如何读取值（'lastPrice'）">30.67