如何解析 HTTP 原始字节并在 python 中获取 HTTP 内容？答案

【问题标题】：How to parse HTTP raw bytes and get the HTTP content in python?如何解析 HTTP 原始字节并在 python 中获取 HTTP 内容？
【发布时间】：2017-12-07 02:53:12
【问题描述】：

我使用 scapy 嗅探一些数据包，得到一些 HTTP 响应数据包，这些数据包是我无法解析的字节。例如：

  b'HTTP/1.1 200 OK\r\nDate: Thu, 07 Dec 2017 02:44:18 GMT\r\nServer:Apache/2.4.18 (Ubuntu)\r\nLast-Modified: Tue, 14 Nov 2017 05:51:36 GMT\r\nETag: "2c39-55deafadf0ac0-gzip"\r\nAccept-Ranges: bytes\r\nVary: Accept-Encoding\r\nContent-Encoding: gzip\r\nContent-Length: 3186\r\nConnection: close\r\nContent-Type: text/html\r\n\r\n\x1f\x8b'

有没有办法获取这个字节数组的内容部分，以便我可以使用 gzip 库进行解码？我不想使用request 来获取 HTTP 响应，因为我只想处理我拥有的原始数据包。

【问题讨论】：

标签： python http

【解决方案1】：

没有内置的方法可以解析这样的原始 HTTP 响应并正确处理压缩。我会使用urllib3:

import urllib3

from io import BytesIO
from http.client import HTTPResponse

class BytesIOSocket:
    def __init__(self, content):
        self.handle = BytesIO(content)

    def makefile(self, mode):
        return self.handle

def response_from_bytes(data):
    sock = BytesIOSocket(data)

    response = HTTPResponse(sock)
    response.begin()

    return urllib3.HTTPResponse.from_httplib(response)

if __name__ == '__main__':
    import socket

    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect(('httpbin.org', 80))
    sock.send(b'GET /gzip HTTP/1.1\r\nHost: httpbin.org\r\n\r\n')

    raw_response = sock.recv(8192)

    response = response_from_bytes(raw_response)
    print(response.headers)
    print(response.data)

【讨论】：

非常感谢！这正是我需要的！
@嗨，我还有一个问题。如何解析 HTTP 请求的原始字节？
@user6456568：你什么意思？在我的示例代码中，raw_response 是带有 gzip 压缩主体的原始 HTTP 响应。
我有一些原始字节，它们是 HTTP 请求或响应，我想同时解析它们。
@user6456568：解析 HTTP 请求是一个不同的问题：stackoverflow.com/questions/39090366/…

【解决方案2】：

你可以提取字节的值部分

response_bytes.decode('utf-8')

然后，您可以使用 Beautiful Soup 将返回的信息解析为您想要的任何部分。

【讨论】：

谢谢。为什么我会收到错误消息？ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 302: invalid start byte
@user6456568 - 抱歉，我不是帮助解决解码问题的最佳人选。抱歉……