如何使用多处理从带有 Beautiful Soup 的网页中提取链接？答案

【问题标题】：How do I use multiprocessing to extract links from webpages with Beautiful Soup?如何使用多处理从带有 Beautiful Soup 的网页中提取链接？
【发布时间】：2015-08-26 04:26:49
【问题描述】：

我有一个链接列表，我为每个链接创建了一个 Beautiful Soup 对象，并从页面中抓取段落标签中的所有链接。因为我有数百个我想从中获取的链接，所以单个进程会花费比我想要的更多的时间，所以多处理似乎是理想的解决方案。

这是我的代码：

import requests
from bs4 import BeautifulSoup
from multiprocessing import Process, Queue

urls = ['https://hbr.org/2011/05/the-case-for-executive-assistants','https://signalvnoise.com/posts/3450-when-culture-turns-into-policy']

def collect_links(urls):

    extracted_urls = []
    bsoup_objects = []
    p_tags = [] #store language between paragraph tags in each beautiful soup object

    workers = 4 
    processes = [] 
    links = Queue() #store links extracted from urls variable
    web_connection = Queue() #store beautiful soup objects that are created for each url in urls variable 

    #dump each url from urls variable into links Queue for all processes to use
    for url in urls:
        links.put(url)

    for w in xrange(workers):
        p = Process(target = create_bsoup_object, args = (links, web_connection)) 
        p.start()
        processes.append(p)
        links.put('STOP')
        for p in processes:
            p.join()
        web_connection.put('STOP')

    for beaut_soup_object in iter(web_connection.get, 'STOP'):
        p_tags.append(beaut_soup_object.find_all('p'))
    for paragraphs in p_tags:
        bsoup_objects.append(BeautifulSoup(str(paragraphs)))
    for beautiful_soup_object in bsoup_objects:
        for link_tag in beautiful_soup_object.find_all('a'):
            extracted_urls.append(link_tag.get('href'))
    return extracted_urls

def create_bsoup_object(links, web_connection):

    for link in iter(links.get, 'STOP'):
        try:
            web_connection.put(BeautifulSoup(requests.get(link, timeout=3.05).content))
        except requests.exceptions.Timeout as e:
            #client couldn't connect to server or return data in time period specified in timeout parameter in requests.get()
            pass  
        except requests.exceptions.ConnectionError as e:
            #in case of faulty url
            pass           
        except Exception, err:
            #catch regular errors
            print(traceback.format_exc())
            pass
        except requests.exceptions.HTTPError as e:
            pass
    return True

当我运行 collect_links(urls) 时，我得到的不是链接列表，而是一个空列表，并出现以下错误：

Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 266, in _feed
    send(obj)
RuntimeError: maximum recursion depth exceeded while calling a Python object
[]

我不确定那指的是什么。我在某处读到队列最适合简单对象。我存放在其中的漂亮汤品的大小与此有关吗？我将不胜感激。

【问题讨论】：

去掉except … pass 丢弃有用信息的子句。将这些信息放在这个问题中。永远不要使用except …: pass，它永远是错误的。
@msw 你有什么建议，因为我用它从 html 中检索数据，有时，数据不存在，因为不同的页面与另一个页面有几个 % 的不同。我最终得到了很多例外并通过了

标签： python beautifulsoup python-multiprocessing

【解决方案1】：

您放置在队列中的对象必须是可腌制的。例如。

import pickle
import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('http://httpbin.org').text)
print type(soup)
p = pickle.dumps(soup)

此代码引发RuntimeError: maximum recursion depth exceeded while calling a Python object。

相反，您可以将实际的 HTML 文本放在队列中，然后在主线程中通过 BeautifulSoup 传递。这仍然会提高性能，因为您的应用程序可能由于其网络组件而受到 I/O 限制。

在create_bsoup_object() 中执行此操作：

web_connection.put(requests.get(link, timeout=3.05).text)

这会将 HTML 而不是 BeautifulSoup 对象添加到队列中。然后在主进程中解析HTML。

或者解析提取子进程中的url，将extracted_urls放到队列中。

【讨论】：