【问题标题】:python request to get html page is missing the article element contents获取html页面的python请求缺少文章元素内容
【发布时间】:2021-12-23 15:34:42
【问题描述】:

我正在尝试使用 python 从this 网站记录数据,但我得到的数据缺少“仪表板”数据, 包含它的文章元素返回为空, 像这样:

<article id="dashboard"></article>

我提出这样的要求:

import requests
page = requests.get(url)

我怎样才能始终如一地快速获取网站中的数据?

【问题讨论】:

    标签: javascript python html web-scraping


    【解决方案1】:

    您在这里看到的是一个 javscript+web-socket 驱动的网络应用程序。

    当您连接到页面时,javascript 开始建立与服务器的 websocket 连接,该连接会不断将数据发送到 javascript 客户端。然后客户端将其解包为您看到的 html 内容。

    如果你启动一个网络检查器,你可以看到这个网络套接字连接打开:

    现在我们如何在刮刀中复制它?

    我们需要一个 websocket 客户端并发送这些绿色消息以开始接收您想要的数据。比如websocket-client包我们可以做:

    from websocket import create_connection
    ws = create_connection("wss://bitcoin.clarkmoody.com/dashboard/ws")
    
    # replicate the green messages
    ws.send("""{"op":"c","ch":"","pl":{"c":"4de43be4236035c5","s":"9f6e08f07c263998"}}""")
    ws.send("""{"op":"sub","ch":"mod"}""")
    ws.send("""{"op":"sub","ch":"sta"}""")
    ws.send("""{"op":"sub","ch":"sys"}""")
    ws.send("""{"op":"sub","ch":"upd"}""")
    
    # then you can start receiving the data
    
    while True:
        print(ws.recv())
    

    现在由您来解决逆向工程的其余部分。对于初始消息,它似乎是某种订阅(op: sub,ch:upd 可能意味着操作订阅频道 UPD)。无论哪种方式,上述脚本都应将此作为第一响应消息输出,然后继续进行价格调整:

    {"op":"dat","ch":"mod","pl":[{"rows":[{"slug":"p-row","cells":[{"type":"label","slug":"p-label","quiet":true,"label":"Price"},{"type":"price","slug":"p","def":"Market price of Bitcoin","sep":true,"unit":"$","prefix":true,"places":2}]},{"slug":"sd-row","cells":[{"type":"label","slug":"sd-label","quiet":true,"label":"Sats per Dollar"},{"type":"integer","slug":"sd","def":"Value of one US Dollar, expressed in Satoshis","sep":true,"places":0}]},{"slug":"c-row","cells":[{"type":"label","slug":"c-label","quiet":true,"label":"Market Capitalization"},{"type":"price","slug":"c","def":"Product of market price times total mined supply","sep":true,"unit":"$","prefix":true,"places":2}]}],"slug":"markets","name":"Markets","order":10,"help":"Bitcoin spot price and futures information","feature":false,"headless":false},{"rows":[{"slug":"links-row","noInfo":true,"cells":<..TOO LONG FOR SO..>
    

    【讨论】:

    • @Granitosaurust 非常感谢你,我发现在第 4 个 recv() 中,我得到了一个包含我需要的大部分值的字典。以及我通过检查页面找到的密钥。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-10-16
    • 2021-10-14
    • 1970-01-01
    • 1970-01-01
    • 2011-07-11
    相关资源
    最近更新 更多