Web Scraper 多线程 python 3答案

【问题标题】：Webscrape multithread python 3Web Scraper 多线程 python 3
【发布时间】：2023-03-26 18:25:01
【问题描述】：

我一直在做一个简单的网络爬虫程序来学习如何编码，我让它工作了，但我想看看如何让它更快。我想问我如何为这个程序实现多线程？该程序所做的只是打开股票代码文件并在线搜索该股票的价格。

这是我的代码

import urllib.request
import urllib
from threading import Thread

symbolsfile = open("Stocklist.txt")

symbolslist = symbolsfile.read()

thesymbolslist = symbolslist.split("\n")

i=0


while i<len (thesymbolslist):
    theurl = "http://www.google.com/finance/getprices?q=" + thesymbolslist[i] + "&i=10&p=25m&f=c"
    thepage = urllib.request.urlopen(theurl)
    # read the correct character encoding from `Content-Type` request header
    charset_encoding = thepage.info().get_content_charset()
    # apply encoding
    thepage = thepage.read().decode(charset_encoding)
    print(thesymbolslist[i] + " price is " + thepage.split()[len(thepage.split())-1])
    i= i+1

【问题讨论】：

你能展示一下 stocklist.txt 的样子吗
它只是一个包含所有股票名称的文本文档。是这样的：ABY ABEO ABEOW ABIL ABMD AXAS ACIA ACTG 等等，每个后面都有一个 ENTER
也尝试使用请求。比 urllib 好
urllib.request 不一样吗？

标签： python multithreading python-3.x web-scraping

【解决方案1】：

如果你只是在列表上迭代一个函数，我推荐你multiprocessing.Pool.map(function, list)。

https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing%20map#multiprocessing.pool.Pool.map

【讨论】：

【解决方案2】：

您需要使用asyncio。这是非常整洁的包装，也可以帮助您报废。我已经创建了一个关于如何使用 asyncio integrate with linkedin 的小型 sn-p，但您可以很容易地采用它来满足您的需求。

import asyncio
import requests

def scrape_first_site():
    url = 'http://example.com/'
    response = requests.get(url)


def scrape_another_site():
    url = 'http://example.com/other/'
    response = requests.get(url)

loop = asyncio.get_event_loop()

tasks = [
    loop.run_in_executor(None, scrape_first_site),
    loop.run_in_executor(None, scrape_another_site)
]

loop.run_until_complete(asyncio.wait(tasks))
loop.close()

由于默认执行器是 ThreadPoolExecutor，它将在单独的线程中运行每个任务。如果您想在进程中而不是线程中运行任务（可能与 GIL 相关的问题），您可以使用 ProcessPoolExecutor。

【讨论】：