这是一个 Python 脚本,它将给定的 url 保存到一个文件中并使用多个线程来下载它:
#!/usr/bin/env python
import sys
from functools import partial
from itertools import count, izip
from multiprocessing.dummy import Pool # use threads
from urllib2 import HTTPError, Request, urlopen
def download_chunk(url, byterange):
req = Request(url, headers=dict(Range='bytes=%d-%d' % byterange))
try:
return urlopen(req).read()
except HTTPError as e:
return b'' if e.code == 416 else None # treat range error as EOF
except EnvironmentError:
return None
def main():
url, filename = sys.argv[1:]
pool = Pool(4) # define number of concurrent connections
chunksize = 1 << 16
ranges = izip(count(0, chunksize), count(chunksize - 1, chunksize))
with open(filename, 'wb') as file:
for s in pool.imap(partial(download_part, url), ranges):
if not s:
break # error or EOF
file.write(s)
if len(s) != chunksize:
break # EOF (servers with no Range support end up here)
if __name__ == "__main__":
main()
如果服务器返回空正文、416 http 代码或响应大小不完全是 chunksize,则检测到文件结尾。
它支持不理解Range标头的服务器(在这种情况下,所有内容都在单个请求中下载;要支持大文件,请将download_chunk()更改为保存到临时文件并返回要读取的文件名主线程而不是文件内容本身)。
它允许在单个 http 请求中独立更改并发连接数(池大小)和请求的字节数。
要使用多个进程而不是线程,请更改导入:
from multiprocessing.pool import Pool # use processes (other code unchanged)