如何在 Python 3 中解析原始 HTTP 请求？答案

【问题标题】：How to parse raw HTTP request in Python 3?如何在 Python 3 中解析原始 HTTP 请求？
【发布时间】：2016-12-29 15:27:54
【问题描述】：

我正在寻找一种在 Python 3 中解析 http 请求的本地方法。

This question 展示了一种在 Python 2 中执行此操作的方法，但使用了现已弃用的模块（和 Python 2），我正在寻找一种在 Python 3 中执行此操作的方法。

我主要想弄清楚请求的资源并解析标题和一个简单的请求。（即）：

GET /index.html HTTP/1.1
Host: localhost
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8

有人可以告诉我解析此请求的基本方法吗？

【问题讨论】：

你的第一句话表明你知道你应该只使用一个库（例如urllib3，requests）。然后你说你试图在 Python 3 中做到这一点，但不知道怎么做。为什么不直接使用requests？
@JonathonReinhart 我在不允许使用第三方库的环境中工作。
urllib 不是第三方
而且标准库中的这个类会做你想做的事。 docs.python.org/3/library/…
@cricket_007 他没有提到urllib。他提到了第三方urllib3。

标签： python python-3.x http

【解决方案1】：

这些字段名称中的每一个都应由回车符和换行符分隔，然后字段名称和值由冒号分隔。因此，假设您已经将响应作为字符串，它应该就像：

fields = resp.split("\r\n")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
    key,value = field.split(':')#split each line by http field name and value
    output[key] = value

4/13 更新

使用链接到帖子中的示例 http resp：

resp = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nA
ccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozill
a/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.
13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n'


fields = resp.split("\r\n")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
    if not field:
        continue
    key,value = field.split(':')
    output[key] = value    
print(output)

需要额外检查以确保field 不为空。输出：

{'Host': ' www.google.com', 'Connection': ' keep-alive', 'Accept': ' application/xml,application/xhtml+xml,text/html;q=
0.9,text/plain;q=0.8,image/png,*/*;q=0.5', 'User-Agent': ' Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) App
leWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13', 'Accept-Encoding': ' gzip,deflate,sdch', 'Avail-D
ictionary': ' GeNLY2f-', 'Accept-Language': ' en-US,en;q=0.8'}

【讨论】：

该代码不起作用。通过将 maxsplit=1 添加到 split() 来修补它，它实际上会更好。而且你可能想用\n而不是\r\n来分割，这样它会更通用然后不要忘记最后的\r（如果有的话）..
您可能需要考虑一个专门的库，如kiss-headers 来正确处理它们。
@Ousret - 更新了帖子以显示代码即使在帖子中的示例请求上也有效。如果字段为空，我确实需要快速检查错误，但例如它支持的代码。至于使用库，这是一个不错的默认选择。
查看这个标题：User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0 它会因此而失败。 ;)

【解决方案2】：

您可以使用标准库中email 模块中的email.message.Message 类。

通过修改您链接的问题中的answer，下面是解析 HTTP 标头的 Python3 示例。

假设您想创建一个包含所有标题字段的字典：

import email
import pprint
from io import StringIO

request_string = 'GET / HTTP/1.1\r\nHost: localhost\r\nConnection: keep-alive\r\nCache-Control: max-age=0\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\r\nAccept-Encoding: gzip, deflate, sdch\r\nAccept-Language: en-US,en;q=0.8'

# pop the first line so we only process headers
_, headers = request_string.split('\r\n', 1)

# construct a message from the request string
message = email.message_from_file(StringIO(headers))

# construct a dictionary containing the headers
headers = dict(message.items())

# pretty-print the dictionary of headers
pprint.pprint(headers, width=160)

如果您在 python 提示符下运行，结果将如下所示：

{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
 'Accept-Encoding': 'gzip, deflate, sdch',
 'Accept-Language': 'en-US,en;q=0.8',
 'Cache-Control': 'max-age=0',
 'Connection': 'keep-alive',
 'Host': 'localhost',
 'Upgrade-Insecure-Requests': '1',
 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}

【讨论】：

这很棒 - 是的，很抱歉我的原始请求格式不正确。但是，我从哪里获得资源？（即请求的实际资源）。既然我们pop它，我怎么知道实际请求了什么？
@Startec 它将在第一行，以及请求方法和协议版本。
所以我必须在第一行进行一些字符串拆分？
是的，您可能只需要在空白处拆分第一行来获取资源名称。
感谢您的出色回答。你能描述一下StringIO 调用在这里做什么吗？