【问题标题】:Multiprocessing for WebScraping wont start on Windows and MacWebScraping 的多处理不会在 Windows 和 Mac 上启动
【发布时间】:2020-01-24 08:00:04
【问题描述】:

几天前我在这里问了一个关于多处理的问题,一位用户给我发了答案,您可以在下面看到。唯一的问题是这个答案在他的机器上有效,在我的机器上无效。

我已经在 Windows (Python 3.6) 和 Mac(Python 3.8) 上尝试过。我已经在安装时附带的基本 Python IDLE、Windows 上的 PyCharm 和 Jupyter Notebook 上运行了代码,但没有任何反应。我有 32 位 Python。 这是代码:

from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta
from multiprocessing import Pool
import tqdm

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

def parse(url):
    print("im in function")

    response = requests.get(url[4], headers = headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    all_skier_names = soup.find_all("div", class_ = "g-xs-10 g-sm-9 g-md-4 g-lg-4 justify-left bold align-xs-top")
    all_countries = soup.find_all("span", class_ = "country__name-short")

    discipline = url[0]
    season = url[1]
    competition = url[2]
    gender = url[3]

    out = []
    for name, country in zip(all_skier_names , all_countries):
        skier_name = name.text.strip().title()
        country = country.text.strip()
        out.append([discipline, season,  competition,  gender,  country,  skier_name])

    return out

all_urls = [['Cross-Country', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=M&nationcode='],
            ['Cross-Country', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=L&nationcode='],
            ['Cross-Country', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=M&nationcode='],
            ['Cross-Country', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=L&nationcode=']]

with Pool(processes=2) as pool, tqdm.tqdm(total=len(all_urls)) as pbar:
    all_data = []
    print("im in pool")

    for data in pool.imap_unordered(parse, all_urls):
        print("im in data")

        all_data.extend(data)
        pbar.update()

print(all_data) 

我在运行代码时唯一看到的就是进度条,它总是在 0%:

  0%|          | 0/8 [00:00<?, ?it/s]

我在代码末尾的parse(url) 函数和for loop 中设置了几个打印语句,但仍然只打印“im in pool”。 好像代码根本没有进入函数,代码末尾也没有进入for循环。

代码应该在 5-8 秒内执行,但我等了 10 分钟,没有任何反应。我也试过在没有进度条的情况下这样做,但结果是一样的。

你知道问题出在哪里吗? 是我使用的 Python 版本的问题(Python 3.6 32 位)还是一些 lib 的版本,IDK 怎么办...

【问题讨论】:

    标签: python web-scraping multiprocessing pool


    【解决方案1】:

    对您来说更好的选择是多线程,Python 使用 threading 模块实现:

    import threading
        
    if __name__ == "__main__": 
    logging.basicConfig(level=logging.INFO)
    threads = list()
    
    for scraper in scraper_list:
        logging.info("Main    : create and start thread %s.", scraper)
        x = threading.Thread(target=scraper_checker, args=(scraper,))
        threads.append(x)
        x.start()
    
    for index, thread in enumerate(threads):
        thread.join()
        logging.info("Main    : thread %d done", index)
    
    error_file.close()
    success_file.close()
        
      
    print("Done!") 
    

    【讨论】:

      猜你喜欢
      • 2011-06-04
      • 2018-03-22
      • 2021-03-06
      • 2015-09-22
      • 2012-12-19
      • 2016-05-30
      • 1970-01-01
      相关资源
      最近更新 更多