在 python 龙卷风上并行运行函数答案

【问题标题】：run function parallel on python tornado在 python 龙卷风上并行运行函数
【发布时间】：2017-12-29 22:42:24
【问题描述】：

我目前正在tornado 框架上使用python3（仍然是初学者）进行开发，并且我有一个想在后台运行的功能。更准确地说，该函数的任务是下载一个大文件（逐块），并且可能在每个块下载后做更多的事情。但调用函数不应等待下载函数完成，而应继续执行。

这里有一些代码示例：

@gen.coroutine
def dosomethingfunc(self, env):
    print("Do something")

    self.downloadfunc(file_url, target_path) #I don't want to wait here

    print("Do something else")


@gen.coroutine
def downloadfunc(self, file_url, target_path):

    response = urllib.request.urlopen(file_url)
    CHUNK = 16 * 1024

    with open(target_path, 'wb') as f:
        while True:
            chunk = response.read(CHUNK)
            if not chunk:
                break
            f.write(chunk)
            time.sleep(0.1) #do something after a chunk is downloaded - sleep only as example

我已在 stackoverflow https://stackoverflow.com/a/25083098/2492068 上阅读此答案并尝试使用它。

实际上，我想如果我使用@gen.coroutine 但不使用yield，dosomethingfunc 会继续，而无需等待downloadfunc 完成。但实际上行为是相同的（有或没有yield） - "Do something else" 只会在downloadfunc 完成下载后打印。

我在这里缺少什么？

【问题讨论】：

标签： python parallel-processing tornado coroutine

【解决方案1】：

为了受益于 Tornado 的异步功能，在某些时候必须有 yielded 一个非阻塞函数。由于downloadfunc 的代码全部阻塞，所以dosomethingfunc 直到调用函数完成后才能重新获得控制权。

您的代码有几个问题：

time.sleep 正在阻塞，请改用 tornado.gen.sleep，
urllib 的urlopen 被阻塞，使用tornado.httpclient.AsyncHTTPClient

所以downloadfunc 可能看起来像：

@gen.coroutine
def downloadfunc(self, file_url, target_path):

    client = tornado.httpclient.AsyncHTTPClient()

    # below code will start downloading and
    # give back control to the ioloop while waiting for data
    res = yield client.fetch(file_url)

    with open(target_path, 'wb') as f:
        f.write(res)
        yield tornado.gen.sleep(0.1)

要通过流（按块）支持实现它，您可能希望这样做：

# for large files you must increase max_body_size
# because deault body limit in Tornado is set to 100MB

tornado.web.AsyncHTTPClient.configure(None, max_body_size=2*1024**3)

@gen.coroutine
def downloadfunc(self, file_url, target_path):

    client = tornado.httpclient.AsyncHTTPClient()

    # the streaming_callback will be called with received portion of data
    yield client.fetch(file_url, streaming_callback=write_chunk)

def write_chunk(chunk):
    # note the "a" mode, to append to the file
    with open(target_path, 'ab') as f:
        print('chunk %s' % len(chunk))
        f.write(chunk)

现在您可以在不使用yield 的情况下在dosomethingfunc 中调用它，然后函数的其余部分将继续进行。

编辑

不支持（公开）从服务器和客户端修改块大小。你也可以看看https://groups.google.com/forum/#!topic/python-tornado/K8zerl1JB5o

【讨论】：

非常感谢您的示例-我不知道一直阻塞的功能，但确定它是有道理的。不幸的是，我尝试实施您的解决方案，但我得到一个“内容长度”太长的例外（我的文件大约是 1.5GB）。但是使用分块下载应该不是问题。知道为什么我会收到此消息吗？
Tornado 的主体限制设置为 100MB，您可以使用 max_body_size 覆盖它，例如。 AsyncHTTPClient(max_body_size=2000000000)
我试过 usnig AsyncHTTPClient(max_body_size=2000000000) 但它对我没有影响 - 身体大小保持不变。如果我尝试对 SimpleAsyncHTTPClient 进行相同的操作，它可以工作，但是我如何定义块大小？我可以从我的网络监视器中看到它开始下载但在 20 秒后（可能是因为标准超时）它只是超时而没有写入块。有什么方法可以定义块大小吗？
好吧，我发现@gen.coroutine注解对于write_chunk是必须的。不可行:-) ...知道是否可以定义块大小仍然很有趣？此外，我无法使用 gen.sleep 限制下载/写入速度，因为它对异步调用没有影响