网页抓取 CME答案

【问题标题】：Webscraping CME网页抓取 CME
【发布时间】：2021-05-10 21:48:21
【问题描述】：

我正在尝试抓取以下内容：

https://www.cmegroup.com/trading/interest-rates/us-treasury/10-year-us-treasury-note_quotes_volume_voi.html#tradeDate=20210507

尤其是在第一张表中获取总计（大宗交易、EFP、EFR 等）

当我检查页面时，我得到的结果与我实际去抓取并获取“页面源”数据时不同。这让我很难找到数据（我是新手）

经过一番探索，我找到了https://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=json&tradeDate=20210507&reportType=F&productId=316中的数据这是一个excel文件

到目前为止我的代码是

header= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) ' 
  'AppleWebKit/537.11 (KHTML, like Gecko) '
  'Chrome/23.0.1271.64 Safari/537.11',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
  'Accept-Encoding': 'none',
  'Accept-Language': 'en-US,en;q=0.8',
  'Connection': 'keep-alive'}

url = "https://www.cmegroup.com/content/cmegroup/en/trading/interest-rates/us-treasury/10-year-us-treasury-note_quotes_volume_voi.html"
r = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(r)
soup = BeautifulSoup(response, 'lxml')

简而言之，有没有人推荐一种比整理 Excel 文件更好的方法？谢谢！

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

在浏览器中访问此页面时，DOM 会使用 JavaScript 异步填充。您可以期望 BeautifulSoup 不适用于这样的页面，因为 BeautifulSoup 只能看到在服务器为您提供文档时直接烘焙到文档中的内容。

记录我的网络流量显示浏览器向某些 REST API 发出了多个 XHR (XmlHttpRequest) HTTP GET 请求。其中之一返回包含您要查找的信息的 JSON。你所要做的就是模仿那个 HTTP GET 请求：

def main():

    import requests

    url = "https://www.cmegroup.com/CmeWS/mvc/Volume/Details/F/316/20210507/F"

    params = {
        "tradeDate": "20210507",
        "pageSize": "50",
        "_": "1620683546888"
    }

    headers = {
        "Accept": "application/json",
        "Accept-Encoding": "gzip, deflate",
        "User-Agent": "Mozilla/5.0"
    }

    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()

    data = response.json()

    print("Block trades (total volume): {}".format(data["totals"]["blockVolume"]))
    print("EFP (total volume): {}".format(data["totals"]["efpVol"]))
    print("EFR (total volume): {}".format(data["totals"]["efrVol"]))

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

输出：

Block trades (total volume): 7,500
EFP (total volume): 23,958
EFR (total volume): 34,486
>>>

查看我发布的this other answer，我将更深入地了解如何记录您的网络流量、查找 REST API 端点和模仿请求。

【讨论】：