一、BeautifulSoup解析库
1、快速开始
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" >Elsie</a>, <a href="http://example.com/lacie" class="sister" >Lacie</a> and <a href="http://example.com/tillie" class="sister" >Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc) print(soup.prettify()) # 结构化输出文档 print(soup.title) # 获取title标签 print(soup.title.name) # 获取title标签名称 print(soup.title.parent.name) print(soup.p['class'])
从文档中找到所有<a>标签的链接:
for link in soup.find_all('a'): print(link.get('href'))
从文档中获取所有文字内容:
print(soup.get_text())
2、标签选择器
#1、标签选择器:即直接通过标签名字选择,选择速度快,如果存在多个相同的标签则只返回第一个 #2、获取标签的名称 #3、获取标签的属性 #4、获取标签的内容 #5、嵌套选择 #6、子节点、子孙节点 #7、父节点、祖先节点 #8、兄弟节点
#1、标签选择器:即直接通过标签名字选择,选择速度快,如果存在多个相同的标签则只返回第一个 html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" >Elsie</a>, <a href="http://example.com/lacie" class="sister" >Lacie</a> and <a href="http://example.com/tillie" class="sister" >Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup=BeautifulSoup(html_doc,'lxml') print(soup.head) print(type(soup.head)) #<class 'bs4.element.Tag'> print(soup.p) #存在多个相同的标签则只返回第一个 print(soup.a) #存在多个相同的标签则只返回第一个 #2、获取标签的名称 print(soup.p.name) #3、获取标签的属性 print(soup.p.attrs) #4、获取表的内容 print(soup.p.string) ''' 对下面的这种结构,soup.p.string 返回为None,因为里面有a <p id='list-1'> 哈哈哈哈 <a class='sss'> <span> <h1>aaaa</h1> </span> </a> <b>bbbbb</b> </p> ''' #5、嵌套选择 print(soup.head.title.string) print(soup.body.a.string) #6、子节点、子孙节点 html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"> <b>The Dormouse's story</b> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" > <span>Elsie</span> </a>, <a href="http://example.com/lacie" class="sister" >Lacie</a> and <a href="http://example.com/tillie" class="sister" >Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup=BeautifulSoup(html_doc,'lxml') print(soup.p.contents) #p下所有子节点 print(soup.p.children) #得到一个迭代器,包含p下所有子节点 for i,child in enumerate(soup.p.children): print(i,child) print(soup.p.descendants) #获取子孙节点,p下所有的标签都会选择出来 for i,child in enumerate(soup.p.descendants): print(i,child) #7、父节点、祖先节点 print(soup.a.parent) #获取a标签的父节点 print(soup.a.parents) #找到a标签所有的祖先节点,父亲的父亲,父亲的父亲的父亲... #8、兄弟节点 print(soup.a.next_siblings) #得到生成器对象 print(soup.a.previous_siblings) #得到生成器对象