为使用 urllib.urlretrieve 下载的文件添加时间戳答案

【问题标题】：Adding timestamp to file downloaded with urllib.urlretrieve为使用 urllib.urlretrieve 下载的文件添加时间戳
【发布时间】：2013-05-20 13:26:37
【问题描述】：

我正在使用urllib.urlretrieve 下载文件，我想在下载前添加一些内容以检查更改。我已经有类似以下的内容：

import urllib

urllib.urlretrieve("http://www.site1.com/file.txt", r"output/file1.txt")
urllib.urlretrieve("http://www.site2.com/file.txt", r"output/file2.txt")

理想情况下，我希望脚本检查更改（比较上次修改的戳？），如果相同则忽略并下载如果更新，我需要脚本为文件名添加时间戳。

谁能帮忙？

我是编程新手（python 是我的第一个）所以欢迎任何批评！

【问题讨论】：

标签： python urllib

【解决方案1】：

不幸的是，这在 python 中似乎很难做到，因为你必须自己做所有事情。另外urlretrieve的界面也不是很好。

以下代码应执行必要的步骤（如果文件存在，则添加“If-Modified-Since”标头并调整下载文件的时间戳）：

def download_file(url, local_filename):
    opener = urllib.request.build_opener()
    if os.path.isfile(local_filename):
        timestamp = os.path.getmtime(local_filename)
        timestr = time.strftime('%a, %d %b %Y %H:%M:%S GMT', time.gmtime(timestamp))
        opener.addheaders.append(("If-Modified-Since", timestr))
    urllib.request.install_opener(opener)
    try:
        local_filename, headers = urllib.request.urlretrieve(url, local_filename, reporthook=status_callback)
        if 'Last-Modified' in headers:
            mtime = calendar.timegm(time.strptime(headers['Last-Modified'], '%a, %d %b %Y %H:%M:%S GMT'))
            os.utime(local_filename, (mtime, mtime))
    except urllib.error.HTTPError as e:
        if e.code != 304:
            raise e
    urllib.request.install_opener(urllib.request.build_opener())  # Reset opener
    return local_filename

【讨论】：

谢谢，这正是我想要的。值得注意的是，您需要导入urllib.request、time 和calendar（所有这些都在标准库中），并可能实现status_callback 函数或删除此代码的参数才能工作。

【解决方案2】：

urllib.urlretrieve() 已经为您完成了这项工作。如果输出文件名存在，它会进行所有必要的检查以避免再次下载。

但这只有在服务器支持时才有效。因此，您可能需要打印 HTTP 标头（函数调用的第二个结果）以查看是否可以进行缓存。

这篇文章也可能有帮助：http://pymotw.com/2/urllib/

它的代码接近尾声：

import urllib
import os

def reporthook(blocks_read, block_size, total_size):
    if not blocks_read:
        print 'Connection opened'
        return
    if total_size < 0:
        # Unknown size
        print 'Read %d blocks' % blocks_read
    else:
        amount_read = blocks_read * block_size
        print 'Read %d blocks, or %d/%d' % (blocks_read, amount_read, total_size)
    return

try:
    filename, msg = urllib.urlretrieve('http://blog.doughellmann.com/', reporthook=reporthook)
    print
    print 'File:', filename
    print 'Headers:'
    print msg
    print 'File exists before cleanup:', os.path.exists(filename)

finally:
    urllib.urlcleanup()

    print 'File still exists:', os.path.exists(filename)

这会下载一个文件，显示进度并打印标题。使用它来调试您的场景，以找出缓存无法按预期工作的原因。

【讨论】：

嗨 Aaron，即使文件名相同，我的 urllib.urlretrieve 实现也会不断覆盖文件。我需要做些什么来调用此功能吗？
当您说“覆盖”时，您会看到它正在下载块？
你有证据表明 urlretrieve 这样做了吗？我的 /usr/lib/python2.7/urllib.py 中的检索函数绝对没有。从不查看 Last-Modified 标头，从不统计文件以获取时间，以便能够发送 if-modified-since 标头，从而能够使用 304 响应。它使用的唯一标头是 Content-Length - 以确认下载与预期大小匹配。只是盲目地打开 URL 然后写入文件 - 不在乎它是否已经存在。通过查看网络服务器日志以及代码确认（在我的示例中我控制双方）
我的证据是文档：“如果 URL 指向本地文件 [...] 对象不会被复制。”(docs.python.org/2/library/urllib.html#urllib.urlretrieve) 也许文档有问题？

【解决方案3】：

文件名中时间戳的最简单方法是：

import time
'output/file_%d.txt' % time.time()

以这种方式人类可读：

from datetime import datetime
n = datetime.now()
n.strftime('output/file_%Y%m%d_%H%M%S.txt')

【讨论】：

-1 问题是如何判断服务器上的资源是否发生了变化。
我的问题并不完全清楚，但提到了文件名的时间戳
这输出纪元时间，知道如何使其成为标准（人类可读）时间/日期吗？