asyncio.gather 没有等待足够长的时间来完成所有任务答案

【问题标题】：asyncio.gather not waiting long enough for all tasks to completeasyncio.gather 没有等待足够长的时间来完成所有任务
【发布时间】：2020-08-25 01:14:56
【问题描述】：

我正在编写代码以使用 asyncio、aiohttp 和 BeautifulSoup 从输入 url 列表中获取一些链接。

这是相关代码的sn-p：

def async_get_jpg_links(links):
    def extractLinks(ep_num, html):
        soup = bs4.BeautifulSoup(html, 'lxml', 
            parse_only = bs4.SoupStrainer('article'))
        main = soup.findChildren('img')
        return ep_num, [img_link.get('data-src') for img_link in main]

    async def get_htmllinks(session, ep_num, ep_link):
        async with session.get(ep_link) as response:
            html_txt = await response.text()
        return extractLinks(ep_num, html_txt)

    async def get_jpg_links(ep_links):
        async with aiohttp.ClientSession() as session:
            tasks = [get_htmllinks(session, num, link) 
                    for num, link in enumerate(ep_links, 1)]
            return await asyncio.gather(*tasks)

    loop = asyncio.get_event_loop()
    return loop.run_until_complete(get_jpg_links(links))

我稍后会调用jpgs_links = dict(async_get_jpg_links(hrefs))，其中 hrefs 是一堆链接（约 170 个链接）。

jpgs_links 应该是一个带有数字键和一堆列表作为值的字典。一些值作为空列表返回（应该用数据填充）。当我减少 hrefs 中的链接数量时，更多的列表又满了。

对于下面的照片，我用一分钟的时间重新运行了相同的代码，如您所见，我得到了不同的列表，这些列表返回为空，不同的列表返回为满。

会不会是 asyncio.gather 没有等待所有任务完成？

如何让 asyncio 让我不返回空列表，同时保持 hrefs 中的链接数量高？

【问题讨论】：

您确定 URL 每次都以相同的顺序提供吗？ hrefs 是如何填充的？
您是否尝试过添加打印来检查哪些列表是空的，在这种情况下text 是什么。也许您得到了不完整的 HTML 或 BeautifulSoup 出于某种原因无法处理的 HTML。 gather 不太可能不等待所有任务完成，如果确实如此，您将不会得到一个空列表，您会得到 None 或异常。除了 asyncio 错误，gather 不等待所有任务的唯一原因是如果任务引发异常，但随后该异常将通过 get_jpg_links 传播到 run_until_complete 调用。
我已将if not main: print (f'{ep_num} has empty main.') 添加到函数extractLinks 中，每次运行时都会打印不同的ep_num。那么，我认为是 soup.find_all('img') 返回空列表。你知道如何解决这个问题吗？

标签： python web-scraping beautifulsoup python-asyncio

【解决方案1】：

所以，原来我发送的一些网址引发了错误：

raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 504, message='Gateway Time-out',...

所以我改变了

async def get_htmllinks(session, ep_num, ep_link):
        async with session.get(ep_link) as response:
            html_txt = await response.text()
        return extractLinks(ep_num, html_txt)

到

async def get_htmllinks(session, ep_num, ep_link):
    html_txt = None
    while not html_txt:
        try:
            async with session.get(ep_link) as response:
                response.raise_for_status()
                html_txt = await response.text()
        except aiohttp.ClientResponseError:
            await asyncio.sleep(1)
    return extractLinks(ep_num, html_txt)

它的作用是在休眠一秒钟后重试连接（await asyncio.sleep(1) 会这样做）。

显然与 asyncio 或 BeautifulSoup 无关。

【讨论】：

不过，我不明白你是如何得到字典的。异常应该已从run_until_complete 传播和引发。
糟糕。忘了提一下，一旦我将行 response.raise_for_status() 添加到代码中，就会引发错误。这就是为什么之前能够创建dict的原因。
这是有道理的 - 所以你在raise_for_status 之前获得的 html 可能只是一个错误模板，BeautifulSoup 在其中什么也找不到，你最终会得到一个空列表。