Python：（Beautifulsoup）如何限制从 html 新闻文章中提取的文本仅限于新闻文章。答案

【问题标题】：Python: (Beautifulsoup) How to limit extracted text from a html news article to only the news article.Python：（Beautifulsoup）如何限制从 html 新闻文章中提取的文本仅限于新闻文章。
【发布时间】：2017-01-27 14:00:36
【问题描述】：

我编写了这个使用 BeautifulSoup 的测试代码。

url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"    
html = urllib.request.urlopen(url).read()  
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('p'):
    print(n.get_text())

它工作正常，但它也检索不属于新闻文章的文本，例如发布时间、cmets 数量、版权等。

我希望它只从新闻文章本身中检索文本，如何解决这个问题？

【问题讨论】：

您必须查看该网站以及它是如何制作的。新闻属于某个类还是某个标签？然后你可以使用 BS4 根据标签和类或 id 进行过滤。
对于这篇文章，过滤这个：
。这不一定适用于其他网站，有时甚至不一定适用于同一网站上的文章，因此您需要查看每个网站的 HTML。

标签： python html beautifulsoup

【解决方案1】：

newspaper library 专注于抓取文章，您的运气可能会好得多。

如果我们只谈论BeautifulSoup，一个更接近预期结果并拥有更多相关段落的选项是在具有itemprop="articleBody" 属性的div 元素的上下文中找到它们：

article_body = soup.find(itemprop="articleBody")
for p in article_body.find_all("p"):
    print(p.get_text())

【讨论】：

【解决方案2】：

除了p 标签之外，您还需要更具体地定位。尝试寻找div class="article" 或类似的东西，然后只从那里抓取段落

【讨论】：

【解决方案3】：

更具体一点，你需要用class articleBody 来捕捉div，所以：

import urllib.request
from bs4 import BeautifulSoup

url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"    
html = urllib.request.urlopen(url).read()  
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('div', attrs={'itemprop':"articleBody"}):
    print(n.get_text())

对 SO 的回复不仅适合您，也适合来自谷歌搜索等的人。如您所见，attrs 是一个字典，如果需要，可以传递更多属性/值。

【讨论】：

您的方法不适用于 url='newsbeezer.com/brazil/…'。更通用的方法是查找具有属性“articleBody”的
，然后查找 div 和 p 标签