aiohttp.TCPConnector（带限制参数）与 asyncio.Semaphore 用于限制并发连接数答案

【问题标题】：aiohttp.TCPConnector (with limit argument) vs asyncio.Semaphore for limiting the number of concurrent connectionsaiohttp.TCPConnector（带限制参数）与 asyncio.Semaphore 用于限制并发连接数
【发布时间】：2017-08-18 13:03:48
【问题描述】：

我想通过制作一个允许您一次下载多个资源的简单脚本来学习新的 python async await 语法，更具体地说是 asyncio 模块。

但现在我被困住了。

在研究过程中，我发现了两个限制并发请求数量的选项：

将 aiohttp.TCPConnector（带有限制参数）传递给 aiohttp.ClientSession 或
使用 asyncio.Semaphore。

如果您只想限制并发连接数，是否有首选选项或者它们可以互换使用？在性能方面（大致）是否相等？

两者似乎都具有 100 个并发连接/操作的默认值。如果我只使用限制为 500 的信号量，aiohttp 内部是否会隐式地将我锁定为 100 个并发连接？

这对我来说都是非常新的和不清楚的。请随时指出我的任何误解或代码中的缺陷。

这是我当前包含两个选项的代码（我应该删除哪个？）：

额外问题：

如何处理（最好重试 x 次）引发错误的 coros？
完成 coro 后立即保存返回数据（通知我的 DataHandler）的最佳方法是什么？我不希望最后全部保存，因为我可以尽快开始处理结果。

import asyncio
from tqdm import tqdm
import uvloop as uvloop
from aiohttp import ClientSession, TCPConnector, BasicAuth

# You can ignore this class
class DummyDataHandler(DataHandler):
    """Takes data and stores it somewhere"""

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def take(self, origin_url, data):
        return True

    def done(self):
        return None

class AsyncDownloader(object):
    def __init__(self, concurrent_connections=100, silent=False, data_handler=None, loop_policy=None):

        self.concurrent_connections = concurrent_connections
        self.silent = silent

        self.data_handler = data_handler or DummyDataHandler()

        self.sending_bar = None
        self.receiving_bar = None

        asyncio.set_event_loop_policy(loop_policy or uvloop.EventLoopPolicy())
        self.loop = asyncio.get_event_loop()
        self.semaphore = asyncio.Semaphore(concurrent_connections)

    async def fetch(self, session, url):
        # This is option 1: The semaphore, limiting the number of concurrent coros,
        # thereby limiting the number of concurrent requests.
        with (await self.semaphore):
            async with session.get(url) as response:
                # Bonus Question 1: What is the best way to retry a request that failed?
                resp_task = asyncio.ensure_future(response.read())
                self.sending_bar.update(1)
                resp = await resp_task

                await  response.release()
                if not self.silent:
                    self.receiving_bar.update(1)
                return resp

    async def batch_download(self, urls, auth=None):
        # This is option 2: Limiting the number of open connections directly via the TCPConnector
        conn = TCPConnector(limit=self.concurrent_connections, keepalive_timeout=60)
        async with ClientSession(connector=conn, auth=auth) as session:
            await asyncio.gather(*[asyncio.ensure_future(self.download_and_save(session, url)) for url in urls])

    async def download_and_save(self, session, url):
        content_task = asyncio.ensure_future(self.fetch(session, url))
        content = await content_task
        # Bonus Question 2: This is blocking, I know. Should this be wrapped in another coro
        # or should I use something like asyncio.as_completed in the download function?
        self.data_handler.take(origin_url=url, data=content)

    def download(self, urls, auth=None):
        if isinstance(auth, tuple):
            auth = BasicAuth(*auth)
        print('Running on concurrency level {}'.format(self.concurrent_connections))
        self.sending_bar = tqdm(urls, total=len(urls), desc='Sent    ', unit='requests')
        self.sending_bar.update(0)

        self.receiving_bar = tqdm(urls, total=len(urls), desc='Reveived', unit='requests')
        self.receiving_bar.update(0)

        tasks = self.batch_download(urls, auth)
        self.loop.run_until_complete(tasks)
        return self.data_handler.done()


### call like so ###

URL_PATTERN = 'https://www.example.com/{}.html'

def gen_url(lower=0, upper=None):
    for i in range(lower, upper):
        yield URL_PATTERN.format(i)   

ad = AsyncDownloader(concurrent_connections=30)
data = ad.download([g for g in gen_url(upper=1000)])

【问题讨论】：

我也有同样的问题，看来他们可以互换使用stackoverflow.com/questions/35196974/…
asyncio.Semaphore 类的内部计数器只有默认值 1。在这里查看asyncio Synchronisation Primitives 可以根据需要增加到更高的值，但是，您的操作系统仍然对同时打开的文件数量有限制（TCP连接是*nix-like系统中的文件，包括macOS）
对于附加问题 2，请查看软件架构中的生产者-消费者设计模式。
一般我更喜欢看最少的代码来描述问题，但我只是在这里发现了tqdm。不再需要我的手卷 ascii 微调器，谢谢！

标签： python async-await python-3.5 python-asyncio aiohttp

【解决方案1】：

有首选方案吗？

是的，见下文：

aiohttp 内部是否会隐式地将我锁定到 100 个并发连接？

是的，默认值 100 将锁定您，除非您指定其他限制。您可以在此处的源代码中看到它：https://github.com/aio-libs/aiohttp/blob/master/aiohttp/connector.py#L1084

它们在性能方面（大致）相等吗？

否（但性能差异应该可以忽略不计），因为aiohttp.TCPConnector 无论如何都会检查可用连接，无论它是否被信号量包围，在这里使用信号量只是不必要的开销。

如何处理（最好重试 x 次）引发错误的 coros？

我认为没有标准的方法可以做到这一点，但一种解决方案是将您的调用包装在这样的方法中：

async def retry_requests(...):
    for i in range(5):
        try:
            return (await session.get(...)
        except aiohttp.ClientResponseError:
            pass

【讨论】：

【解决方案2】：

如何处理（最好重试 x 次）引发错误的 coros？

我创建了一个 Python 装饰器来处理它

    def retry(cls, exceptions, tries=3, delay=2, backoff=2):
        """
        Retry calling the decorated function using an exponential backoff. This
        is required in case of requesting Braze API produces any exceptions.

        Args:
            exceptions: The exception to check. may be a tuple of
                exceptions to check.
            tries: Number of times to try (not retry) before giving up.
            delay: Initial delay between retries in seconds.
            backoff: Backoff multiplier (e.g. value of 2 will double the delay
                each retry).
        """

        def deco_retry(func):
            @wraps(func)
            def f_retry(*args, **kwargs):
                mtries, mdelay = tries, delay
                while mtries > 1:
                    try:
                        return func(*args, **kwargs)
                    except exceptions as e:
                        msg = '{}, Retrying in {} seconds...'.format(e, mdelay)
                        if logging:
                            logging.warning(msg)
                        else:
                            print(msg)
                        time.sleep(mdelay)
                        mtries -= 1
                        mdelay *= backoff
                return func(*args, **kwargs)

            return f_retry

        return deco_retry

【讨论】：