如何在我的 BeautifulSoup WebScraper 中实现多处理答案

【问题标题】：How to implement multiprocessing in my BeautifulSoup WebScraper如何在我的 BeautifulSoup WebScraper 中实现多处理
【发布时间】：2020-01-23 14:13:19
【问题描述】：

我用 Python 和 BeautifulSoup 库制作了一个网络 scraper，它运行良好，唯一的问题是它非常慢。所以现在，我想实现一些多处理，以便我可以加快速度，但我不知道如何。

我的代码来自两个部分。第一部分是 scraping 网站，以便我可以生成我想进一步抓取的 url，并将这些 url 附加到列表中。第一部分如下所示：

from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta
from multiprocessing import Pool

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

links = [["Cross-Country", "https://www.fis-ski.com/DB/cross-country/cup-standings.html", "?sectorcode=CC&seasoncode={}&cupcode={}&disciplinecode=ALL&gendercode={}&nationcode="],
         ["Ski Jumping", "https://www.fis-ski.com/DB/ski-jumping/cup-standings.html", ""],
         ["Nordic Combined", "https://www.fis-ski.com/DB/nordic-combined/cup-standings.html", ""],
         ["Alpine", "https://www.fis-ski.com/DB/alpine-skiing/cup-standings.html", ""]]

# FOR LOOP FOR GENERATING URLS FOR SCRAPING

all_urls = []
for link in links[:1]:
    
    response = requests.get(link[1], headers = headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    discipline = link[0]
    print(discipline)

    season_list = []
    competition_list = []
    gender_list = ["M", "L"]

    
    all_seasons = soup.find_all("div", class_ = "select select_size_medium")[0].find_all("option")
    for season in all_seasons:
        season_list.append(season.text)

    all_competitions = soup.find_all("div", class_ = "select select_size_medium")[1].find_all("option")
    for competition in all_competitions:
        competition_list.append([competition["value"], competition.text])


    for gender in gender_list:
        for competition in competition_list[:1]:
            for season in season_list[:2]:

                url = link[1] + link[2].format(season, competition[0], gender)
                all_urls.append([discipline, season, competition[1], gender, url])
                
                print(discipline, season, competition[1], gender, url)
                print()

print(len(all_urls))

您这第一部分生成了 4500 多个链接，但我添加了一些索引限制，使其仅生成 8 个链接。这是代码的第二部分，它的函数基本上是一个 for 循环，逐个 url 并抓取特定数据。第二部分：

# FUNCTION FOR SCRAPING
def parse():
    for url in all_urls:

        response = requests.get(url[4], headers = headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        all_skier_names = soup.find_all("div", class_ = "g-xs-10 g-sm-9 g-md-4 g-lg-4 justify-left bold align-xs-top")
        all_countries = soup.find_all("span", class_ = "country__name-short")

        
        discipline = url[0]
        season = url[1]
        competition = url[2]
        gender = url[3]

        
        for name, country in zip(all_skier_names , all_countries):

            skier_name = name.text.strip().title()
            country = country.text.strip()
            
            print(discipline, "|", season, "|", competition, "|", gender, "|", country, "|", skier_name)

        print()

parse()

我已经阅读了一些文档，我的多处理部分应该如下所示：

p = Pool(10)  # Pool tells how many at a time
records = p.map(parse, all_urls)
p.terminate()
p.join()

但是我跑了这个，我等了 30 分钟，什么也没发生。我做错了什么，如何使用池实现多处理，以便我可以同时抓取 10 个或更多 url？

【问题讨论】：

这可能会帮助你stackoverflow.com/questions/59086617/…
对不起，但这对我没有帮助
仅供参考，它是 scraper 和 scraping 而不是 scrapper 或 scrapping

标签： python web-scraping beautifulsoup

【解决方案1】：

这是multiprocessing.Pool 的简单实现。注意，我使用tqdm 模块来显示漂亮的进度条（查看长时间运行的程序的当前进度很有用）：

from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta
from multiprocessing import Pool
import tqdm

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

def parse(url):
    response = requests.get(url[4], headers = headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    all_skier_names = soup.find_all("div", class_ = "g-xs-10 g-sm-9 g-md-4 g-lg-4 justify-left bold align-xs-top")
    all_countries = soup.find_all("span", class_ = "country__name-short")

    discipline = url[0]
    season = url[1]
    competition = url[2]
    gender = url[3]

    out = []
    for name, country in zip(all_skier_names , all_countries):
        skier_name = name.text.strip().title()
        country = country.text.strip()
        out.append([discipline, season,  competition,  gender,  country,  skier_name])

    return out

# here I hard-coded all_urls:
all_urls = [['Cross-Country', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=M&nationcode='], ['Cross-Country', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=L&nationcode='], ['Ski Jumping', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/ski-jumping/cup-standings.html'], ['Ski Jumping', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/ski-jumping/cup-standings.html'], ['Nordic Combined', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/nordic-combined/cup-standings.html'], ['Nordic Combined', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/nordic-combined/cup-standings.html'], ['Alpine', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/alpine-skiing/cup-standings.html'], ['Alpine', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/alpine-skiing/cup-standings.html']]

with Pool(processes=2) as pool, tqdm.tqdm(total=len(all_urls)) as pbar: # create Pool of processes (only 2 in this example) and tqdm Progress bar
    all_data = []                                                       # into this list I will store the urls returned from parse() function
    for data in pool.imap_unordered(parse, all_urls):                   # send urls from all_urls list to parse() function (it will be done concurently in process pool). The results returned will be unordered (returned when they are available, without waiting for other processes)
        all_data.extend(data)                                           # update all_data list
        pbar.update()                                                   # update progress bar

# Note:
# this for-loop will have 8 iterations (because all_urls has 8 links)

# print(all_data) # <-- this is your data

【讨论】：

我已经运行了你的代码，我有 4 个链接，已经超过 10 分钟，进度条上仍然是 0%，如果我运行一个链接大约需要 20-30 秒'不做多处理。我还在代码末尾运行了print(all_data)，但它没有打印任何内容。
你认为这个问题可能与索引（ url[0]、url[1]、url[2]、url[3] url[4]）有关吗？我的意思是，现在大约 30 分钟，我的代码仍在运行，它仍然是 0%，并且屏幕上没有打印任何内容
@taga 你复制的代码完全和我回答的一样吗？因为在我的机器上大约需要 5 秒，打印 len(all_data) 给了我 640
我试过硬编码，还是一样的……我在等……但还是0%
@taga 我在最后添加了一些 cmets。

【解决方案2】：

@andrej-kesely 发布的代码在空闲时可以正常工作。确保代码在应有的位置有适当的间距

【讨论】：

我不知道是什么问题。我重新启动了计算机，运行了 Andrej 发布的代码，我只能看到：0%| | 0/8 [00:00, ?it/s]