【问题标题】:Multicore processing on scraper function刮板功能上的多核处理
【发布时间】:2020-07-19 04:52:28
【问题描述】:

我希望通过使用多个核心来加速我的抓取工具,以便多个核心可以从我使用预定义函数 scrape 的列表中的 URL 中抓取。我该怎么做?

这是我当前的代码:

for x in URLs['identifier'][1:365]:
test= scrape(x)
results = test.get_results
results['identifier'] = x
final= final.append(results)

【问题讨论】:

  • 你可以简单地使用带有线程的joblib/go;可能会有所帮助 stackoverflow.com/a/62548199/6524169
  • 这会让我所有的 8 个内核处理我的 365 个 URL 的相等部分吗?请问代码看起来如何?
  • 我现在在下面添加了一个示例;

标签: python pandas web-scraping


【解决方案1】:

类似这样的东西,(或者你也可以使用 Scrapy)

如果服务器也可以处理它,它将很容易让您并行发出大量请求;

# it's just a wrapper around concurrent.futures ThreadPoolExecutor with a nice tqdm progress bar!
from tqdm.contrib.concurrent import thread_map, process_map # for multi-threading, multi-processing respectively)

def chunk_list(lst, size):
    for i in range(0, len(lst), size):
        yield lst[i:i + size]

for idx, my_chunk in enumerate(chunk_list(huge_list, size=2**12)):
    for response in thread_map(<which_func_to_call>, my_chunk, max_workers=your_cpu_cores+6)):
        # which_func_to_call -> wrap the returned response json obj in this, etc
        # do something with the response now..
        # make sure to cache the chunk results as well (in case you are having lot of them)

在 Python 中使用多处理模块中的池..

from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup

base_url = 'http://quotes.toscrape.com/page/'

all_urls = list()

def generate_urls():
    # better to yield them as well if you already have the URL's list etc..
    for i in range(1,11):
        all_urls.append(base_url + str(i))
    
def scrape(url):
    res = requests.get(url)
    print(res.status_code, res.url)

generate_urls()

p = Pool(10)
p.map(scrape, all_urls)
p.terminate()
p.join()

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-12-29
    • 2012-04-22
    • 2012-01-26
    • 2020-02-26
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多