解析两个相似的html文件时差很大答案

【问题标题】：Big time difference in parsing two similar html files解析两个相似的html文件时差很大
【发布时间】：2018-01-20 21:50:03
【问题描述】：

我有两个来自 web 服务的搜索结果，保存为 html，我必须使用 BeautifulSoup 进行解析才能提取一些数据。我注意到其中一个需要大约。比另一个长 35 倍。

有人对此有解释/知道我可以做些什么来提高较慢的 html 文件的性能？

设置：

Python 2.7.13
Jupyter Notebook 4.3.1
beautifulsoup4 (4.5.3)
lxml (3.8.0)

代码：

from bs4 import BeautifulSoup

path = "path to the files"
file_1 = "slow.html"
file_2 = "fast.html"

with open(path+file_1) as rfile_1:
    html_1 = rfile_1.read()
with open(path+file_2) as rfile_2:
    html_2 = rfile_2.read()

%timeit soup = BeautifulSoup(html_1, 'lxml')
>> 1 loop, best of 3: 4.67 s per loop
%timeit soup = BeautifulSoup(html_2, 'lxml')
>> 10 loops, best of 3: 136 ms per loop

【问题讨论】：

标签： python python-2.7 beautifulsoup

【解决方案1】：

当我在你的两个 HTML 文件上计时 BeautifulSoup 时，结果与你的相反。 “快”的时间大约是“慢”的两倍。我不知道为什么会这样。

>>> timeit.timeit("import bs4;HTML = open('slow.html').read();bs4.BeautifulSoup(HTML, 'lxml')", number=1000)
83.10731378142236
>>> timeit.timeit("import bs4;HTML = open('fast.html').read();bs4.BeautifulSoup(HTML, 'lxml')", number=1000)
147.65896100030727

如果解析时间很重要，那么我建议使用scrapy。对于您的每个文件，它会在大约四分之一的时间内交付结果。

>>> timeit.timeit("from scrapy.selector import Selector;HTML = open('slow.html').read();Selector(text=HTML)", number=1000)
21.85675587779292
>>> timeit.timeit("from scrapy.selector import Selector;HTML = open('fast.html').read();Selector(text=HTML)", number=1000)
39.938533099930055

【讨论】：

slow.html 大约是 fast.html 大小的一半，因此您的结果是有意义的。由于您在 35 倍的时间内没有得到相同的结果，我可以假设我的 python/packages 安装存在问题？使用 scrapy 时获得与您相同的结果 - 感谢您的提示
对此的简短回答是，我不知道。当你拥有 Python、Jupyter 和所有这些东西的时候，谁能说呢？首先，获得有意义的时间是一件令人头疼的事情。