【问题标题】:Random 3-4 long strings in a http response pythyonhttp响应python中的随机3-4个长字符串
【发布时间】:2021-03-03 02:37:58
【问题描述】:

我正在尝试使用 python 中的套接字模块发出请求。它成功地发出请求,获取响应并对其进行解码。当我查看 HTML 文档时,除了 HTML 文档中有 3-4 个随机长的随机字符串之外,一切都是正确的。我认为我的代码是正确的,但我不是 100% 确定。这是我的代码:

def recive_data(get, timeout):
  ready = select.select([get], [], [], timeout)
  if ready[0]:
    return get.recv(4096)
  return b""

def get_file(website, port, file, https=False):
  data = []
  new_data = ""

  if https:
    get = ssl.create_default_context().wrap_socket(socket.socket(socket.AF_INET, socket.SOCK_STREAM), server_hostname=website)
  else:
    get = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  get.connect((website, port))
  get.sendall(f"GET {file} HTTP/1.1\r\nHost: {website}:{port}\r\n\r\n".encode())
  while True:
    new_data = recive_data(get, 5).decode()
    if new_data != "" and new_data != None:
      data.append(new_data)
      new_data = ""
    else:
      break

  data = "".join(data)
  header = data[0:data.find(newline+newline)]
  data = data[data.find(newline+newline):data.rfind(f"{newline}0{newline}{newline}")]

  data = BeautifulSoup(data, 'html.parser').prettify()

  get.close()
  return (header, data)

如果我输入https://stackoverflow.com,它会输出:

30d
<!DOCTYPE html>
<html class="html__responsive html__unpinned-leftnav">
 <head>
  <title>
   Stack Overflow - Where Developers Learn, Share, &amp; Build Careers
  </title>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" rel="shortcut icon"/>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="apple-touch-icon"/>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="image_src"/>
  <link href="/opensearch.xml" rel="search" title="Stack Overflow" type="application/opensearchdescription+xml"/>
  <meta content="Stack Overflow is the largest, most trusted online communi
20d0
ty for developers to learn, share​ ​their programming ​knowledge, and build their careers." name="description"/>
  <meta content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0" name="viewport"/>
  <meta content="website" property="og:type">

等等…… 但是,有些网站比其他网站拥有更多,我也无法弄清楚。非常感谢任何帮助!

【问题讨论】:

    标签: python html python-3.x https get


    【解决方案1】:

    响应中标题的最后一行给你一个线索:

    HTTP/1.1 200 OK
    Connection: keep-alive
    cache-control: private
    ...
    transfer-encoding: chunked
    

    transfer-encoding 表示标题后面的内容不是纯 HTML。来自the spec

       The chunked encoding modifies the body of a message in order to
       transfer it as a series of chunks, each with its own size indicator,
       followed by an OPTIONAL trailer containing entity-header fields
    ...
       The chunk-size field is a string of hex digits indicating the size of
       the chunk. The chunked encoding is ended by any chunk whose size is
       zero, followed by the trailer, which is terminated by an empty line.
    

    换句话说,您看到的是一个十六进制数字,显示下一个块中的字节数。可能有不止一个块。您需要检查该 HTTP 标头,如果存在,请找到所有块并将它们连接在一起,然后再将页面解析为 HTML。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2014-09-03
      • 2011-01-03
      • 1970-01-01
      • 2018-07-03
      • 2018-01-25
      • 2011-06-27
      相关资源
      最近更新 更多