1 BeautifulSoup库的理解
1.1 BeautifulSoup库是解析、遍历、维护“标签树”的功能库
标签的基本结构:
1.2 BeautifulSoup库解析器
1.3 BeautifulSoup类的基本元素
基本使用:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
# 获取a标签的内容
# 获取某个标签的内容,可以通过soup.标签来获取
print(soup.a)
# 获取标签的名字
print(soup.a.name)
# 获取a标签父亲的名字
print(soup.a.parent.name)
# 获取a标签上上层标签的名字
print(soup.a.parent.parent.name)
# 获取标签的属性
print(soup.a.attrs)
print(soup.a.attrs['class'])
# 获取标签中菲属性字符串信息
print(soup.a.string)
输出为:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
a
p
body
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
['py1']
Basic Python
1.4 标签树的下行遍历:
1.5 标签树的上行遍历:
1.6 标签树的平行遍历:
1.7 prettify()函数为标签和文本添加换行符
import requests
from bs4 import BeautifulS
soup = BeautifulSoup("<a>t
print(soup.prettify())
输出:
<a>
this is a example
</a>