从大量网页中导出信息答案

【问题标题】：Export informations from tons of web pages从大量网页中导出信息
【发布时间】：2015-12-16 16:56:48
【问题描述】：

我有一个 python 3 脚本，它使用库 urllib.request 和 BeautifulSoup 加载网站的内容，并将信息从中导出到 csv 文件或 MySQL 数据库。以下是脚本中的主要代码行：

# ... 

url = urllib.request.urlopen("<urls here>")
html = url.read()
url.close()
soup = BeautifulSoup(html, "html.parser")
# Create lists for html elements
nadpis = soup.find_all("span", class_="nadpis")     
# Some more soups here...

onpage = len(no) # No. of elements on page
for i in range(onpage):
    nadpis[i] = one_column(nadpis[i].string)
    # Some more soups here

if csv_export:
    with open("export/" + category[c][0] + ".csv", "ab") as csv_file:
        wr = csv.writer(csv_file, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL, lineterminator='\n') 
        wr.writerow("<informations from soup>")

# Insert to database
if db_insert:
    try:        
        cursor.execute("<informations from soup>")
        conn.commit()
    except Exception:
        print("Some MySQL error...")
        break

# ...

完整的脚本有 200 行代码，所以我不会在这里发送垃圾邮件。一切正常。问题是我需要从大量网页中扫描和导出信息（一切都在 while 循环中，但现在没有必要）并且它变得非常慢（运行时间数小时）。

有更快的方法吗？

我实现了多处理，因此我可以利用每个 CPU 内核，但无论如何导出所有内容可能需要 24 小时。我什至在 Amazon EC2 服务器上进行了测试，但无论如何它并没有更快，所以问题不在于我的 PC 或互联网连接速度慢。

【问题讨论】：

有很多方法可以提高性能，但您的问题中没有足够的信息。您应该确定可能的瓶颈（远程服务器、带宽、延迟、cpu、磁盘等）并查看您是否可以实现性能目标

标签： python python-3.x beautifulsoup urllib

【解决方案1】：

如果您遇到性能问题，我建议您开始 profiling 您的代码。这将使您非常详细地了解您的代码大部分时间都在哪里运行。您还可以测量脚本废弃每个网页所需的时间，也许您会发现某些网页的加载时间比其他网页多，这表明您不受带宽限制，而是受服务器限制正在尝试访问。

但是，您所说的“大量网页”是什么？如果您的脚本经过合理优化并且您使用了所有 CPU 内核，那么看起来您可能只需要废弃许多网页才能以您想要的速度执行它（顺便说一句，您希望它有多快？ )

【讨论】：

好吧，尽可能快。我读到了这个名为 ZMap 的工具：pastebin.com/Ah2v2fiW 工作起来很完美的东西，所以我想知道现在是否没有一些库。我所说的“大量网页”是指 200k 个单独的页面。我会研究分析，谢谢提示。
ZMap 看起来像一个扫描地址的工具，但似乎不适合报废（但我可能错了）。另外，我建议您将抓取过程与写入过程分开（在 cvs 或数据库中）。由于您正在使用并行化，因此您可以使用一个专门的工作人员来输出您的结果，这将为其他工作人员腾出一些时间来废弃您的内容。

【解决方案2】：

我会推荐simple-requests，比如：

from simple_requests import Requests

# Creates a session and thread pool
requests = Requests(concurrent=2, minSecondsBetweenRequests=0.15)

# Cookies are maintained in this instance of Requests, so subsequent requests
# will still be logged-in.
urls = [
    'http://www.url1.com/',
    'http://www.url2.com/',
    'http://www.url3.com/' ]

# Asynchronously send all the requests for profile pages
for url_response in requests.swarm(urls):
    html = url_response.html
    soup = BeautifulSoup(html, "html.parser")

    # Some more soups here...

    # Write out your file...

【讨论】：