使用 BeautifulSoup 排除 findAll 的不需要的结果答案

【问题标题】：Excluding unwanted results of findAll using BeautifulSoup使用 BeautifulSoup 排除 findAll 的不需要的结果
【发布时间】：2013-10-21 12:07:47
【问题描述】：

使用 BeautifulSoup，我的目标是抓取与此 HTML 挂钩相关的文本：

<p class="review_comment">

所以，使用如下简单代码，

content = page.read()  
soup = BeautifulSoup(content)  
results = soup.find_all("p", "review_comment")

我很乐意解析这里的文本：

<p class="review_comment">
    This place is terrible!</p>

坏消息是，soup.find_all 每 30 次左右匹配一次，它还会匹配并抓取一些我真的不想要的东西，这是用户更新后的旧评论：

<p class="review_comment">
    It's 1999, and I will always love this place…  
<a href="#" class="show-archived">Read more &raquo;</a></p>

在尝试排除这些旧的重复评论时，我尝试了各种想法。

我一直在尝试更改我的soup.find_all() 调用中的参数明确排除出现在 <a href="#" class="show-archived">Read more »</a> 之前的任何文本
我陷入了正则表达式类型匹配的困境，但没有成功。
我似乎无法利用 class="show-archived" 属性。

任何想法将不胜感激。提前致谢。

【问题讨论】：

【解决方案1】：

这就是你想要的吗？

for p in soup.find_all("p", "review_comment"):
    if p.find(class_='show-archived'):
        continue
    # p is now a wanted p

【讨论】：