Python 中的电子邮件收集器：如何提高性能？答案

【问题标题】：Email harvester in Python: How to improve performance?Python 中的电子邮件收集器：如何提高性能？
【发布时间】：2013-01-24 00:42:19
【问题描述】：

我找到了一个名为“Best Email Extractor”的程序http://www.emailextractor.net/。该网站说它是用 Python 编写的。我试图写一个类似的程序。上述程序每分钟提取大约 300 - 1000 封电子邮件。我的程序每小时提取大约 30-100 封电子邮件。有人可以给我一些关于如何提高我的程序性能的提示吗？我写了以下内容：

import sqlite3 as sql
import urllib2
import re
import lxml.html as lxml
import time
import threading


def getUrls(start):

    urls = []
    try:
        dom = lxml.parse(start).getroot()
        dom.make_links_absolute()

        for url in dom.iterlinks():
            if not '.jpg' in url[2]:
                if not '.JPG' in url[2]:
                    if not '.ico' in url[2]:
                        if not '.png' in url[2]:
                            if not '.jpeg' in url[2]:
                                if not '.gif' in url[2]:
                                    if not 'youtube.com' in url[2]:
                                        urls.append(url[2])
    except:
        pass

    return urls

def getURLContent(urlAdresse):

    try:
      url = urllib2.urlopen(urlAdresse)
      text = url.read()
      url.close()
      return text
    except:
        return '<html></html>'

def harvestEmail(url):
    text = getURLContent(url)

    emails = re.findall('[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}', text)

    if emails:
        if saveEmail(emails[0]) == 1:
            print emails[0]

def saveUrl(url):

    connection = sql.connect('url.db')

    url = (url, )

    with connection:
        cursor = connection.cursor()
        cursor.execute('SELECT COUNT(*) FROM urladressen WHERE adresse = ?', url)
        data = cursor.fetchone()
        if(data[0] == 0):
            cursor.execute('INSERT INTO urladressen VALUES(NULL, ?)', url)
            return 1
        return 0

def saveEmail(email):
    connection = sql.connect('emails.db')
    email = (email, )

    with connection:
        cursor = connection.cursor()
        cursor.execute('SELECT COUNT(*) FROM addresse WHERE email = ?', email)
        data = cursor.fetchone()
        if(data[0] == 0):
            cursor.execute('INSERT INTO addresse VALUES(NULL, ?)', email)
            return 1
    return 0

def searchrun(urls):
    for url in urls:
        if saveUrl(url) == 1:
            #time.sleep(0.6)
            harvestEmail(url)
            print url
            urls.remove(url)
            urls = urls + getUrls(url)

urls1 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=DVD')
urls2 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Jolie')
urls3 = getUrls('http://www.finanzen.net')
urls4 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Party')
urls5 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Games')
urls6 = getUrls('http://www.spiegel.de')
urls7 = getUrls('http://www.kicker.de/')
urls8 = getUrls('http://www.chessbase.com')
urls9 = getUrls('http://www.nba.com')
urls10 = getUrls('http://www.nfl.com')


try:
    threads = []
    urls = (urls1, urls2, urls3, urls4, urls5, urls6, urls7, urls8, urls9, urls10)

    for urlList in urls:
        thread = threading.Thread(target=searchrun, args=(urlList, )).start()
        threads.append(thread)
    print threading.activeCount()
    for thread in threads:
        thread.join()
except RuntimeError:
    print RuntimeError

【问题讨论】：

-1 你会用这些电子邮件给我的朋友做什么？邀请人们参加你的派对？？？
这与电子邮件无关。我对如何更快地获取网站感兴趣。

标签： python performance email python-multithreading

【解决方案1】：

我认为没有多少人会帮助您收集电子邮件。这是一项普遍令人厌恶的活动。

关于代码中的性能瓶颈，您需要通过 profiling 找出时间的去向。在最低级别，将每个函数替换为不进行处理但返回有效输出的虚拟函数；因此电子邮件收集器可以返回相同地址的列表 100 次（或者这些 URL 结果中有很多）。这将告诉您哪个功能正在花费您的时间。

突出的东西：

事先从服务器获取URL后面的文件；如果您每次运行脚本时都向 Google 发送垃圾邮件，他们很可能会阻止您。从磁盘读取比从 Internet 请求文件更快，并且可以单独和同时完成。
数据库代码正在为每次调用 saveEmail 等创建一个新连接，这将花费大部分时间进行握手和身份验证。最好有一个对象来保持调用之间的连接，或者最好一次插入多条记录。
网络和数据库问题解决后，正则表达式可能会在其周围加上\b，以便匹配减少回溯。
一系列 if not 'foo' in str: then if not 'blah' in str ... 是糟糕的编码。提取最后一段，并通过创建一个set 甚至frozenset 的所有非允许值（如ignoredExtensions = set([jpg,png,gif])）并与if not extension in ignoredExtensions 等比较来检查多个值。另请注意，首先将扩展名转换为小写意味着无论是 jpg 还是 JPG，检查和工作都会减少。
最后，考虑在多个命令行上运行相同的脚本而不使用线程。除了协调不同的 url 列表之外，实际上不需要在脚本中使用线程。坦率地说，在文件中只包含一组 url 列表并启动一个单独的脚本来处理每个列表会简单得多。让操作系统做多线程，它更擅长。

【讨论】：

……甚至大多数不讨厌电子邮件收集的人也会讨厌你，因为他们这样做是为了赚钱，并且不想竞争一群小公司可能会被骗，浪费钱以他们的名义发送垃圾邮件……
非常感谢您的回答。我现在使用多处理而不是线程。它提高了相当多的性能。我读到应该避免使用 Python 的线程。我将尝试 Phil H 在第 5 点下描述的技术。