刮板功能上的多核处理答案

【问题标题】：Multicore processing on scraper function刮板功能上的多核处理
【发布时间】：2020-07-19 04:52:28
【问题描述】：

我希望通过使用多个核心来加速我的抓取工具，以便多个核心可以从我使用预定义函数 scrape 的列表中的 URL 中抓取。我该怎么做？

这是我当前的代码：

for x in URLs['identifier'][1:365]:
test= scrape(x)
results = test.get_results
results['identifier'] = x
final= final.append(results)

【问题讨论】：

你可以简单地使用带有线程的joblib/go；可能会有所帮助 stackoverflow.com/a/62548199/6524169
这会让我所有的 8 个内核处理我的 365 个 URL 的相等部分吗？请问代码看起来如何？
我现在在下面添加了一个示例；

标签： python pandas web-scraping

【解决方案1】：

类似这样的东西，（或者你也可以使用 Scrapy）

如果服务器也可以处理它，它将很容易让您并行发出大量请求；

# it's just a wrapper around concurrent.futures ThreadPoolExecutor with a nice tqdm progress bar!
from tqdm.contrib.concurrent import thread_map, process_map # for multi-threading, multi-processing respectively)

def chunk_list(lst, size):
    for i in range(0, len(lst), size):
        yield lst[i:i + size]

for idx, my_chunk in enumerate(chunk_list(huge_list, size=2**12)):
    for response in thread_map(<which_func_to_call>, my_chunk, max_workers=your_cpu_cores+6)):
        # which_func_to_call -> wrap the returned response json obj in this, etc
        # do something with the response now..
        # make sure to cache the chunk results as well (in case you are having lot of them)

或

在 Python 中使用多处理模块中的池..

from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup

base_url = 'http://quotes.toscrape.com/page/'

all_urls = list()

def generate_urls():
    # better to yield them as well if you already have the URL's list etc..
    for i in range(1,11):
        all_urls.append(base_url + str(i))
    
def scrape(url):
    res = requests.get(url)
    print(res.status_code, res.url)

generate_urls()

p = Pool(10)
p.map(scrape, all_urls)
p.terminate()
p.join()

【讨论】：