网页抓取。如何让它更快？答案

【问题标题】：Web-scraping. How make it faster ?网页抓取。如何让它更快？
【发布时间】：2018-10-10 20:24:56
【问题描述】：

我必须从网页中提取一些属性（在我的示例中只有一个：应用程序的文本描述）。问题是时间！确实，使用以下代码进入一个页面，提取 HTML 的一部分并保存它，每页大约需要 1.2-1.8 秒。很多时间。有没有办法让它更快？我有很多页，x 也可能是 200000。我正在使用木星。

    Description=[]
    for x in range(len(M)):
        response = http.request('GET',M[x] )
        soup = BeautifulSoup(response.data,"lxml")
        t=str(soup.find("div",attrs={"class":"section__description"}))
        Description.append(t)

谢谢

【问题讨论】：

你可以看看多处理。
M 是网址列表吗？
Matt Cremeens，是的，是的
使用scrapy进行更快的抓取！

标签： python-3.x web-scraping beautifulsoup jupyter webpage

【解决方案1】：

您应该考虑一下inspecting the page。如果页面依赖于 Rest API，您可以通过直接从 API 获取所需的内容来抓取它们。这比从 HTML 中获取内容要有效得多。要使用它，您应该查看Requests library for Python。

【讨论】：

【解决方案2】：

我会尝试根据我的评论将其分解为多个进程。所以你可以把你的代码放到一个函数中，像这样使用多处理

from multiprocessing import Pool

def web_scrape(url):
    response = http.request('GET',url )
    soup = BeautifulSoup(response.data,"lxml")
    t=str(soup.find("div",attrs={"class":"section__description"}))
    return t

if __name__ == '__main__':
    # M is your list of urls
    M=["https:// ... , ... , ... ]
    p = Pool(5) # 5 or how many processes you think is appropriate (start with how many cores you have, maybe)
    description=p.map(web_scrape, M))
    p.close()
    p.join()
    description=list(description) # if you need it to be a list

发生的情况是，您的 url 列表正在分发到运行您的抓取功能的多个进程。然后，所有结果最终都会合并并在description 中结束。这应该比您像当前那样一次处理一个网址要快得多。

【讨论】：

Matt Creemeens，我正在尝试您的解决方案，并且正在查看您提供给我的文档。我以前从未使用过多处理。非常感谢
请告诉我进展如何。
代码正在运行，谢谢。我看不出时间上的差异。可能是因为我只有 2 个核心
你要处理多少个url？你也可以考虑在服务器上运行它，比如 pythonanywhere，这样你就不会让它消耗你自己的资源，也许会使用更多的 CPU。
大约200000。谢谢，我会考虑使用服务器资源