【发布时间】:2016-02-04 07:11:41
【问题描述】:
我很困惑如何将 ResultSet 对象与 BeautifulSoup 一起使用,即bs4.element.ResultSet。
使用find_all()后,如何提取文本?
例子:
在bs4 文档中,HTML 文档html_doc 看起来像:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link2">
Tillie
</a>
; and they lived at the bottom of a well.
</p>
首先创建soup 并找到所有href,
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all('a')
哪个输出
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
我们也可以
for link in soup.find_all('a'):
print(link.get('href'))
哪个输出
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
我想只从class_="sister"获取文本,即
Elsie
Lacie
Tillie
大家可以试试
for link in soup.find_all('a'):
print(link.get_text())
但这会导致错误:
AttributeError: 'ResultSet' object has no attribute 'get_text'
【问题讨论】:
标签: html beautifulsoup python-requests