脚本即使在异步运行时也执行得很慢答案

【问题标题】：Script performs very slowly even when it runs asynchronously脚本即使在异步运行时也执行得很慢
【发布时间】：2018-12-11 07:00:34
【问题描述】：

我在 asyncio 中编写了一个与 aiohttp 库关联的脚本来解析异步网站的内容。我尝试按照 scrapy 中通常应用的方式在以下脚本中应用逻辑。

但是，当我执行我的脚本时，它就像同步库（如 requests 或 urllib.request 做。因此，它很慢，不能达到目的。

我知道我可以通过在 link 变量中定义所有下一页链接来解决这个问题。但是，我不是已经以正确的方式使用现有脚本完成任务了吗？

在脚本中processing_docs() 函数所做的是收集不同帖子的所有链接，并将精炼的链接传递给fetch_again() 函数以从其目标页面获取标题。 processing_docs() 函数中应用了一个逻辑，它收集 next_page 链接并将其提供给 fetch() 函数以重复相同的操作。 This next_page call is making the script slower whereas we usually do the same inscrapyand get expected performance.

我的问题是：如何保持现有逻辑不变？

import aiohttp
import asyncio
from lxml.html import fromstring
from urllib.parse import urljoin

link = "https://stackoverflow.com/questions/tagged/web-scraping"

async def fetch(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            text = await response.text()
            result = await processing_docs(session, text)
        return result

async def processing_docs(session, html):
        tree = fromstring(html)
        titles = [urljoin(link,title.attrib['href']) for title in tree.cssselect(".summary .question-hyperlink")]
        for title in titles:
            await fetch_again(session,title)

        next_page = tree.cssselect("div.pager a[rel='next']")
        if next_page:
            page_link = urljoin(link,next_page[0].attrib['href'])
            await fetch(page_link)

async def fetch_again(session,url):
    async with session.get(url) as response:
        text = await response.text()
        tree = fromstring(text)
        title = tree.cssselect("h1[itemprop='name'] a")[0].text
        print(title)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.gather(*(fetch(url) for url in [link])))
    loop.close()

【问题讨论】：

标签： python python-3.x web-scraping python-asyncio aiohttp

【解决方案1】：

使用 asyncio 的全部意义在于您可以同时运行多个提取（彼此并行）。让我们看看你的代码：

for title in titles:
    await fetch_again(session,title)

这部分意味着每个新的fetch_again 只会在前一个等待（完成）之后启动。如果你这样做，是的，使用同步方法没有区别。

要调用 asyncio 的所有功能，使用 asyncio.gather 同时启动多个提取：

await asyncio.gather(*[
    fetch_again(session,title)
    for title 
    in titles
])

您会看到显着的加速。

您可以继续进行活动并开始fetch 用于下一页同时fetch_again 用于标题：

async def processing_docs(session, html):
        coros = []

        tree = fromstring(html)

        # titles:
        titles = [
            urljoin(link,title.attrib['href']) 
            for title 
            in tree.cssselect(".summary .question-hyperlink")
        ]

        for title in titles:
            coros.append(
                fetch_again(session,title)
            )

        # next_page:
        next_page = tree.cssselect("div.pager a[rel='next']")
        if next_page:
            page_link = urljoin(link,next_page[0].attrib['href'])

            coros.append(
                fetch(page_link)
            )

        # await:
        await asyncio.gather(*coros)

重要提示

虽然这种方法可以让您更快地完成任务，但您可能希望同时限制并发请求的数量，以避免在您的计算机和服务器上使用大量资源。

您可以为此目的使用asyncio.Semaphore：

semaphore = asyncio.Semaphore(10)

async def fetch(url):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
                result = await processing_docs(session, text)
            return result

【讨论】：

完美！！！非常感谢您的宝贵建议和解决方案@Mikhail Gerasimov。
@asmitu 不客气！只是不要忘记使用信号量，否则网站最终可能会禁止您的爬虫:)
当我尝试定义信号量的脚本时，它会卡在进程中的某个地方并停留在那里。我也尝试使用 Boundedsemaphore，但得到了相同的行为。你能告诉@Mikhail Gerasimov可能是什么问题吗？再次感谢。
@asmitu 你能改变像session.get(url, timeout=5) 这样的代码（在两个地方）并检查你是否收到超时错误？它可能与信号量无关，但与爬虫的 SO 限制有关。我想在某些时候网站刚刚开始以更大的延迟交付页面。如果您想获取 SO 数据，请考虑使用他们的 API 以避免此类情况。
@asmitu 对我来说似乎没问题 :) 可能应该再添加一件事 - 超时。 aiohttp 中的默认超时是5 minutes，您可能希望在失败时尽快获得异常。