如何让beautifulsoup更快地获取抓取网页中的所有元素？答案

【问题标题】：How to get the beautifulsoup faster getting all elements in scraping a web page?如何让beautifulsoup更快地获取抓取网页中的所有元素？
【发布时间】：2021-05-17 17:22:53
【问题描述】：

如何更快地获得beautifulsoup 刮刀？这段代码看起来很慢，有什么方法可以更快？

def getNews():
        tic=time.perf_counter()
        requests_session = requests.Session()
        scrapy = requests.get('https://www.marketwatch.com/markets?mod=top_nav ').content
        product = SoupStrainer('div', {'id': 'collection__elements j-scrollElement'})
        soup = BeautifulSoup(scrapy, 'lxml')
        for div in soup.findAll('div', attrs={'class': 'collection__elements j-scrollElement'}):
            for div in div.findAll('div', attrs={'class':'article__content'}):
                for div2 in div.find_all('h3', attrs={'class':'article__headline'}):
                     for a in div2.find_all('a', href=True):
                         if a.text:
                            print(a.text)
                            print(a['href'])
        toc=time.perf_counter()
        print(toc-tic)

【问题讨论】：

“这段代码看起来很慢”。但是是吗？请定义“慢”。
执行时间过长
定义“太长”

标签： python python-3.x beautifulsoup request lxml

【解决方案1】：

除非故事有更多内容，否则您的代码与下面的选项一样快，但我的代码可以找到更多文章和故事。这对你来说可能或可能不重要。我正在使用现代 Windows 笔记本电脑上网

您看到哪些时间让您认为这并不快？或者你认为应该是什么？它以 1 秒的 1/3 运行。

    %%timeit
    requests_session = requests.Session()
    scrapy = requests.get('https://www.marketwatch.com/markets?mod=top_nav ').content
    soup = BeautifulSoup(scrapy, 'lxml')
    for div in soup.findAll('div', attrs={'class': 'collection__elements j-scrollElement'}):
        for div in div.findAll('div', attrs={'class':'article__content'}):
            for div2 in div.find_all('h3', attrs={'class':'article__headline'}):
                 for a in div2.find_all('a', href=True):
                     if a.text:
                        # print(a.text)
                        print(a['href'])

# 318 ms ± 55.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
requests_session = requests.Session()
scrapy = requests.get('https://www.marketwatch.com/markets?mod=top_nav ').content
soup = BeautifulSoup(scrapy, 'lxml')
for link in soup.find_all('a', class_='link', href=re.compile('articles|story')):
    print(link.get('href'))

# 317 ms ± 58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

【讨论】：

好的，谢谢，我需要 1.6 秒。无论如何，您知道如何将所有结果插入到 Dash 中的 Html.table 中吗？迭代结果？
欢迎您。请提出一个新问题，有经验的人可以回答。试一试并粘贴您的代码尝试。