仅解析 div 类 python 中的文本

【问题标题】：Parse only text within a div class python仅解析 div 类 python 中的文本
【发布时间】：2017-02-02 12:42:19
【问题描述】：

所以我想做的是阅读源代码，搜索名为“gsc_prf_il”的 div 类，然后在这个 div 类中，只提取文本，忽略 href 链接。例如

<div class="gsc_prf_il"><a href="/citations?view_op=view_org&hl=en&org=13784427342582529234">McGill University</a></div>

但是当我使用这段代码时，它不起作用，只会给我错误：AttributeError: 'NoneType' object has no attribute 'contents'

soup=BeautifulSoup(p.readlines()[0], 'html.parser')
s=soup.find(id='gsc_prf_il')
scholar_info['department']= s.contents

然后我尝试了这个：

scholar_info['department']=[s.find('a')['href'], s.find('a').contents[0]]

它也不起作用。我究竟做错了什么？

【问题讨论】：

标签： python python-2.7 beautifulsoup html-parsing

【解决方案1】：

只需找到 div 并拉出文本，您正在寻找 soup.find(id='gsc_prf_il')，它正在寻找具有 id 的 gsc_prf_il 的元素而不是具有该类的 div：

from bs4 import BeautifulSoup
url = "http://python-data.dr-chuck.net/comments_283660.html"

soup = BeautifulSoup("""<div class="gsc_prf_il"><a href="/citations?view_op=view_org&hl=en&org=13784427342582529234">McGill University</a></div>""")

所以使用class_="gsc_prf_il":

print(soup.find("div", class_="gsc_prf_il").text) -> McGill University

或者使用css选择器：

print(soup.select_one("div.gsc_prf_il").text) -> McGill University

【讨论】：

工作就像一个魅力！非常感谢！