如何在 Python 中加快网页抓取速度？答案

【问题标题】：How to speed up web crawling in Python?如何在 Python 中加快网页抓取速度？
【发布时间】：2014-05-12 08:12:00
【问题描述】：

我正在使用urllib.urlopen() 方法和 BeautfulSoup 进行爬行。我对浏览速度不满意，我正在考虑 urllib 正在解析什么，猜测它必须加载的不仅仅是 html。在文档中找不到它是否默认读取或检查更大的数据（图像、闪存等）。

那么，如果 urllib 必须加载，即图像、flash、js……如何避免对此类数据类型的 GET 请求？

【问题讨论】：

您是否要同时加载多个网站？
好的，谢谢你的提问。
看看here这个问题——也许你可以使用这些技术同时处理更多请求。可以产生很大的不同（带宽足够，大部分延迟都是“等待”）。
您可以查看 Scrapy 以使用 Python 进行网络抓取。 scrapy.org默认会并行处理网页。

标签： python web-services web-crawler urllib

【解决方案1】：

试试requests - 它实现了加速爬取的HTTP连接池。

此外，它比 urllib 更好地处理 cookie、auth 等其他事情，并且与 BeautfulSoup 配合得很好..

【讨论】：

【解决方案2】：

使用线程！超级简单。这是一个例子。您可以根据需要更改连接数。

import threading, Queue
import urllib

urls = [
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',    
    ]

queue = Queue.Queue()
for x,url in enumerate(urls):
    filename = "datafile%s-%s" % (x,url)
    queue.put((url, filename))


num_connections = 10

class WorkerThread(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        while 1:
            try:
                url, filename = self.queue.get_nowait()
            except Queue.Empty:
                raise SystemExit

            urllib.urlretrieve(url,filename.replace('http://',''))

# start threads
threads = []
for dummy in range(num_connections):
    t = WorkerThread(queue)
    t.start()
    threads.append(t)


# Wait for all threads to finish
for thread in threads:
    thread.join()

【讨论】：

如我所见，这是多线程的解决方案。我想知道如何消除非 html 内容。
您可以使用“黑名单”来跳过包含您不需要的内容的网址。例如... blacklist = ['.jpeg','.jpg','.gif']