在 Python 中并行化使用 BeautifulSoup 的 for 循环答案

【问题标题】：Parallelizing a for loop that uses BeautifulSoup in Python在 Python 中并行化使用 BeautifulSoup 的 for 循环
【发布时间】：2020-07-11 05:58:37
【问题描述】：

我正在尝试优化以下循环：

all_a = []

for i in range(0, len(final_all)):
    soup = BeautifulSoup(final_all[i], 'html.parser')
    for t in soup.select('table[width="100%"]'):
        t.extract()
        for row in soup.select('tr'):
            name = row.get_text(strip=True, separator=' ').split('—', maxsplit=1)
            if name not in all_a:
                all_a.append(name)

其中final_all 是一个包含 30,000 个 html 文档的列表，这些文档看起来像来自此 question 的 .html。

解析一个html文档的时间不到一秒。

我在想是否有一种聪明的方法可以将两个使用soup.select() 的循环组合在一个循环中。我也没有成功尝试使用集合。

我也尝试了multiprocessing，只有 30 次观察，但我显然犯了一个错误：

%%time
all_a = [] 

    def worker(data):
        for i in range(0, len(data)):
            start = time.time()
            soup = BeautifulSoup(data[i], 'html.parser')
            for t in soup.select('table[width="100%"]'):
                t.extract()
                for row in soup.select('tr'):
                    name = row.get_text(strip=True, separator=' ').split('—', maxsplit=1)
                    if name not in all_a:
                        all_a.append(name)

test = final_all[0:30]

if __name__ == '__main__':   
    pool = mp.Pool(8) # os.cpu_count*2  
    start = time.time()
    final = worker(test)


CPU times: user 1min 50s, sys: 2.91 s, total: 1min 53s
Wall time: 1min 48s

与我不使用多处理的时候相比：

CPU times: user 1min 39s, sys: 1.78 s, total: 1min 41s
Wall time: 1min 39s

【问题讨论】：

这对于multiprocessing 模块来说听起来不错。

标签： multithreading beautifulsoup multiprocessing html-parsing python-multiprocessing

【解决方案1】：

试试这个。

all_a = []
for i in range(0, len(final_all)):
    soup = BeautifulSoup(final_all[i], 'html.parser')
    for t in soup.select('table[width="100%"]'):
        t.extract()
    for row in soup.select('tr'): # Out of the loop above
        name = row.get_text(strip=True, separator=' ').split('—', maxsplit=1)
        all_a.append(name)
all_a = list(set(all_a))

【讨论】：

不幸的是，这并没有返回所有的观察结果。
@Antarqui 如果数据量大，建议你存到数据库里做复制。
谢谢，但问题与存储无关。第三个循环需要嵌套在第二个循环中。