如何在 urllib2 请求中获取默认标头？答案

【问题标题】：How do you get default headers in a urllib2 Request?如何在 urllib2 请求中获取默认标头？
【发布时间】：2009-03-02 20:24:41
【问题描述】：

我有一个使用 urllib2 的 Python Web 客户端。将 HTTP 标头添加到我的传出请求中很容易。我只是创建了一个包含我想要添加的标头的字典，并将其传递给 Request 初始化程序。

但是，其他“标准”HTTP 标头以及我明确添加的自定义标头都会添加到请求中。当我使用 Wireshark 嗅探请求时，除了我自己添加的标题之外，我还看到了标题。我的问题是如何访问这些标头？我想记录每个请求（包括 full 组 HTTP 标头），但不知道如何。

任何指针？

简而言之：如何从 urllib2 创建的 HTTP 请求中获取所有传出标头？

【问题讨论】：

标签： python urllib2

【解决方案1】：

如果您想查看发送出去的文字 HTTP 请求，并因此查看每一个最后一个标头与它在线路上所表示的完全相同，那么您可以告诉 urllib2 使用您自己的 HTTPHandler 版本打印出（或保存，或其他）传出的 HTTP 请求。

import httplib, urllib2

class MyHTTPConnection(httplib.HTTPConnection):
    def send(self, s):
        print s  # or save them, or whatever!
        httplib.HTTPConnection.send(self, s)

class MyHTTPHandler(urllib2.HTTPHandler):
    def http_open(self, req):
        return self.do_open(MyHTTPConnection, req)

opener = urllib2.build_opener(MyHTTPHandler)
response = opener.open('http://www.google.com/')

运行这段代码的结果是：

GET / HTTP/1.1
Accept-Encoding: identity
Host: www.google.com
Connection: close
User-Agent: Python-urllib/2.6

【讨论】：

如果通过 SSL 连接，请改用 urllib2.HTTPSHandler (https_open()) 和 httplib.HTTPSConnection。

【解决方案2】：

urllib2 库使用 OpenerDirector 对象来处理实际打开。幸运的是，python 库提供了默认值，因此您不必这样做。然而，正是这些 OpenerDirector 对象添加了额外的标头。

在请求发送后查看它们是什么（例如，以便您可以记录它）：

req = urllib2.Request(url='http://google.com')
response = urllib2.urlopen(req)
print req.unredirected_hdrs

(produces {'Host': 'google.com', 'User-agent': 'Python-urllib/2.5'} etc)

unredirected_hdrs 是 OpenerDirectors 转储其额外标头的地方。只需查看 req.headers 就只会显示您自己的标题 - 库会为您保留那些不受干扰的标题。

如果您需要在发送请求之前查看标头，则需要子类化 OpenerDirector 以拦截传输。

希望对您有所帮助。

编辑：我忘了提到，一旦发送请求，req.header_items() 将为您提供所有标题的元组列表，包括您自己的和 OpenerDirector 添加的。我应该首先提到这一点，因为它是最直接的 :-) 抱歉。

编辑 2：在您对定义自己的处理程序的示例提出问题之后，这是我想出的示例。任何对请求链的关注是我们需要确保处理程序对于多个请求是安全的，这就是为什么我不喜欢直接替换 HTTPConnection 类上 putheader 的定义。

遗憾的是，由于 HTTPConnection 和 AbstractHTTPHandler 的内部结构非常内部，我们必须从 python 库中复制大部分代码来注入我们的自定义行为。假设我没有在下面犯错并且这与我在 5 分钟的测试中一样有效，如果您将 Python 版本更新为修订号（即：2.5.x 到 2.5.y 或2.5 到 2.6 等）。

因此，我应该提到我使用的是 Python 2.5.1。如果您有 2.6 或特别是 3.0，您可能需要相应地进行调整。

如果这不起作用，请告诉我。这个问题让我太开心了：

import urllib2
import httplib
import socket


class CustomHTTPConnection(httplib.HTTPConnection):

    def __init__(self, *args, **kwargs):
        httplib.HTTPConnection.__init__(self, *args, **kwargs)
        self.stored_headers = []

    def putheader(self, header, value):
        self.stored_headers.append((header, value))
        httplib.HTTPConnection.putheader(self, header, value)


class HTTPCaptureHeaderHandler(urllib2.AbstractHTTPHandler):

    def http_open(self, req):
        return self.do_open(CustomHTTPConnection, req)

    http_request = urllib2.AbstractHTTPHandler.do_request_

    def do_open(self, http_class, req):
        # All code here lifted directly from the python library
        host = req.get_host()
        if not host:
            raise URLError('no host given')

        h = http_class(host) # will parse host:port
        h.set_debuglevel(self._debuglevel)

        headers = dict(req.headers)
        headers.update(req.unredirected_hdrs)
        headers["Connection"] = "close"
        headers = dict(
            (name.title(), val) for name, val in headers.items())
        try:
            h.request(req.get_method(), req.get_selector(), req.data, headers)
            r = h.getresponse()
        except socket.error, err: # XXX what error?
            raise urllib2.URLError(err)
        r.recv = r.read
        fp = socket._fileobject(r, close=True)

        resp = urllib2.addinfourl(fp, r.msg, req.get_full_url())
        resp.code = r.status
        resp.msg = r.reason

        # This is the line we're adding
        req.all_sent_headers = h.stored_headers
        return resp

my_handler = HTTPCaptureHeaderHandler()
opener = urllib2.OpenerDirector()
opener.add_handler(my_handler)
req = urllib2.Request(url='http://www.google.com')

resp = opener.open(req)

print req.all_sent_headers

shows: [('Accept-Encoding', 'identity'), ('Host', 'www.google.com'), ('Connection', 'close'), ('User-Agent', 'Python-urllib/2.5')]

【讨论】：

这很有帮助。但是，我仍然没有看到 all 标题（例如，连接：关闭）
嗯....您介意发布您如何构建请求以及如何打开连接吗？
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookie_jar)) request = urllib2.Request(url, None, headers)
我认为 req.header_items() 不会包含底层 HTTPConnection 发送的标头。
贾斯图斯是对的。特别是“连接：关闭”... Opener 有一个名为“do_open”的方法，该方法被添加。它是由该函数中的局部变量添加的，该变量构造了一个完全独立的请求对象；该请求对象在函数范围的末尾被丢弃

【解决方案3】：

这样的事情怎么样：

import urllib2
import httplib

old_putheader = httplib.HTTPConnection.putheader
def putheader(self, header, value):
    print header, value
    old_putheader(self, header, value)
httplib.HTTPConnection.putheader = putheader

urllib2.urlopen('http://www.google.com')

【讨论】：

这非常接近我的需要。唯一的问题是当我在循环中调用它时，它会不断附加重复标题。
JUSTUS，这太接近了..如果您有任何其他想法，您可以更新您的答案吗？
我不明白你所说的“循环”是什么意思。但是，鉴于这需要如此多的黑客技术，我想知道为什么您需要如此多的日志记录。你最好使用 http 代理，让它完成所有的日志记录，然后使用 urllib 与之对话。
好吧.. 我有一个负载测试工具，可以重复发送 HTTP 请求。它有一个记录/调试模式，我想记录完整的 HTTP 请求和响应......包括标题。

【解决方案4】：

低级解决方案：

import httplib

class HTTPConnection2(httplib.HTTPConnection):
    def __init__(self, *args, **kwargs):
        httplib.HTTPConnection.__init__(self, *args, **kwargs)
        self._request_headers = []
        self._request_header = None

    def putheader(self, header, value):
        self._request_headers.append((header, value))
        httplib.HTTPConnection.putheader(self, header, value)

    def send(self, s):
        self._request_header = s
        httplib.HTTPConnection.send(self, s)

    def getresponse(self, *args, **kwargs):
        response = httplib.HTTPConnection.getresponse(self, *args, **kwargs)
        response.request_headers = self._request_headers
        response.request_header = self._request_header
        return response

例子：

conn = HTTPConnection2("www.python.org")
conn.request("GET", "/index.html", headers={
    "User-agent": "test",
    "Referer": "/",
})
response = conn.getresponse()

response.status、response.reason：

1: 200 OK

response.request_headers：

[('Host', 'www.python.org'), ('Accept-Encoding', 'identity'), ('Referer', '/'), ('User-agent', 'test')]

response.request_header：

GET /index.html HTTP/1.1
Host: www.python.org
Accept-Encoding: identity
Referer: /
User-agent: test

【讨论】：

【解决方案5】：

另一个解决方案，女巫使用了 How do you get default headers in a urllib2 Request? 的想法，但不从 std-lib 复制代码：

class HTTPConnection2(httplib.HTTPConnection):
    """
    Like httplib.HTTPConnection but stores the request headers.
    Used in HTTPConnection3(), see below.
    """
    def __init__(self, *args, **kwargs):
        httplib.HTTPConnection.__init__(self, *args, **kwargs)
        self.request_headers = []
        self.request_header = ""

    def putheader(self, header, value):
        self.request_headers.append((header, value))
        httplib.HTTPConnection.putheader(self, header, value)

    def send(self, s):
        self.request_header = s
        httplib.HTTPConnection.send(self, s)


class HTTPConnection3(object):
    """
    Wrapper around HTTPConnection2
    Used in HTTPHandler2(), see below.
    """
    def __call__(self, *args, **kwargs):
        """
        instance made in urllib2.HTTPHandler.do_open()
        """
        self._conn = HTTPConnection2(*args, **kwargs)
        self.request_headers = self._conn.request_headers
        self.request_header = self._conn.request_header
        return self

    def __getattribute__(self, name):
        """
        Redirect attribute access to the local HTTPConnection() instance.
        """
        if name == "_conn":
            return object.__getattribute__(self, name)
        else:
            return getattr(self._conn, name)


class HTTPHandler2(urllib2.HTTPHandler):
    """
    A HTTPHandler which stores the request headers.
    Used HTTPConnection3, see above.

    >>> opener = urllib2.build_opener(HTTPHandler2)
    >>> opener.addheaders = [("User-agent", "Python test")]
    >>> response = opener.open('http://www.python.org/')

    Get the request headers as a list build with HTTPConnection.putheader():
    >>> response.request_headers
    [('Accept-Encoding', 'identity'), ('Host', 'www.python.org'), ('Connection', 'close'), ('User-Agent', 'Python test')]

    >>> response.request_header
    'GET / HTTP/1.1\\r\\nAccept-Encoding: identity\\r\\nHost: www.python.org\\r\\nConnection: close\\r\\nUser-Agent: Python test\\r\\n\\r\\n'
    """
    def http_open(self, req):
        conn_instance = HTTPConnection3()
        response = self.do_open(conn_instance, req)
        response.request_headers = conn_instance.request_headers
        response.request_header = conn_instance.request_header
        return response

编辑：更新源代码

【讨论】：

【解决方案6】：

参见 urllib2.py:do_request（第 1044 (1067) 行）和 urllib2.py:do_open（第 1073 行）（第 293 行）self.addheaders = [('User-agent', client_version)]（仅添加了'User-agent'）

【讨论】：

【解决方案7】：

在我看来，您正在寻找响应对象的标头，其中包括 Connection: close 等。这些标头位于 urlopen 返回的对象中。获得它们很容易：

from urllib2 import urlopen
req = urlopen("http://www.google.com")
print req.headers.headers

req.headers 是httplib.HTTPMessage 的一个实例

【讨论】：

nope.. 正在寻找请求标头，而不是响应标头
啊，那么您要么需要为 HTTP 请求创建自己的处理程序，像上面的示例那样转储它，或者如果您愿意调整 stdlib，只需放入日志AbstractHTTPHandler.do_open 中转储标头的行。
变量应该拼写为rep，因为它是回复而不是请求，您应该使用记录在案的.info()方法而不是未记录的headers属性。

【解决方案8】：

它应该将默认的 http 标头（由 w3.org 指定）与您指定的标头一起发送。如果您想完整查看它们，可以使用WireShark 之类的工具。

编辑：

如果您想记录它们，您可以使用WinPcap 来捕获特定应用程序（在您的情况下为python）发送的数据包。您还可以指定数据包的类型和许多其他详细信息。

-约翰

【讨论】：

我需要从我的 Python 程序中记录它们，这样 WinPcap 就帮不了我了。不过谢谢。
是的，它会的，你有没有读过它是什么或如何使用它？它与 wireshark 程序本身一起使用，它显示了数据包的分析输出并能够记录它们。
数据包包含标头，我认为这很明显。您可以在您的应用程序中调用/合并 winpcap。
winpcap 适用于 Windows。我的应用程序运行所有平台。它的开销也太大了。不过感谢您的建议。