BeautifulSoup库的使用

'''
BeautifulSoup的基本元素
BS4库是解析，遍历，维护“标签树”的功能库
BeautifulSoup类指代一个标签树
BeautifulSoup类对应于一个HTML或XML文档的全部内容
'''
from bs4 import BeautifulSoup
import requests
r = requests.get("http://python123.io/ws/demo.html")

soup = BeautifulSoup(r.text,'html.parser')
print(soup.title)
print(soup.a)
tag = soup.a
print(tag.attrs)
print(tag.attrs['class'])
'''
输出结果
<title>This is a python demo page</title>
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
['py1']
'''

Beautifunsoup的遍历方式

BeautifulSoup库的使用

标签树的下行遍历

属性	说明
.contents	子节点的列表，将<tag>所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子的节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

实例

from bs4 import BeautifulSoup
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo =  r.text
soup  = BeautifulSoup(demo,'html.parser')
print(soup.prettify())
print(soup.body.contents)
for child in soup.body.children:
    print(child)
for child in soup.body.descendants:
    print(child)
 '''
 输出结果
 <html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']


<p class="title"><b>The demo python introduces several python courses.</b></p>


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>




<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:


<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
.
 '''

标签树的上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

from bs4 import BeautifulSoup
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo =  r.text
soup  = BeautifulSoup(demo,'html.parser')
print(soup.prettify())
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
'''
输出结果
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>
p
body
html
[document]
'''

标签树的平行遍历（注意：同一父节点的儿子节点才构成平行遍历的关系）

属性	说明
.next_sibiling	返回按照HTML文本顺序的下一个平行节点的标签
.previous_sililing	返回按照HTML文本顺序的上一个平行节点的标签
.next_sibilings	迭代类型，返回按照HTML文本顺序的后续所有平行节点的标签
.previous_sibilings	迭代类型，返回按照HTML文本顺序的前续所有平行节点的标签

信息标记的三种形式

HTML语言可以将超文件内容（即声音，图片，视频等内容）嵌入到文本当中

XML扩展标记语言

BeautifulSoup库的使用

JSON JavaScript Object Notation

key:value键值对

BeautifulSoup库的使用

YMAL

采用无类型的键值对来表示信息

BeautifulSoup库的使用

标记语言	特点	应用场景
XML	最早的通用信息标记语言，可扩展性好，但繁琐	Internet上的信息交互与传递
JSON	信息有类型，适合程序处理(js)，较XML简洁	移动应用云端和节点的信息通信，无注释
YAML	信息无类型，文本信息比例较高，可读性较好	种类系统的配置文件，有注释易读

标记语言	特点	应用场景
XML	最早的通用信息标记语言，可扩展性好，但繁琐	Internet上的信息交互与传递
JSON	信息有类型，适合程序处理(js)，较XML简洁	移动应用云端和节点的信息通信，无注释
YAML	信息无类型，文本信息比例较高，可读性较好	种类系统的配置文件，有注释易读

结合形式解析和搜索方法，提取关键信息

XML JSON YAML 搜索

需要标记解析器及文本查找函数

实例：解析一个文档中的所有链接信息

from bs4 import BeautifulSoup
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo =  r.text
s  = BeautifulSoup(demo,'html.parser')
for link in s.find_all('a'):
     print(link.get('href'))
'''
输出结果
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001
'''

<>.find_all(name,attrs,recursive,string,**kwargs)

name: 需要检索的标签的名字，可以是列表，是True的话将返回所有的标签

attrs: 要检索的标签的属性值 ,e.g. soup.find_all('p','course') soup.find_all(id='link1') soup.find_all(id = re.compile(u'link'))

recursive: 是否递归的检索子孙后代节点，默认是True

string: <>...</>中的字符串区域的检索字符串 soup.find_all(string = re.compile(u'python'))

<tag>(...)来代替<tag>.find_all(...)

soup.(...)来代替soup.find_all(...)

BeautifulSoup库的使用

实例1：中国大学排名的定向爬虫

版本1

import requests
from bs4 import BeautifulSoup
import bs4
 
def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
 
def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])
 
def printUnivList(ulist, num):
    print("{:^10}\t{:^6}\t{:^10}".format("排名","学校名称","总分"))
    for i in range(num):
        u=ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0],u[1],u[2]))
     
def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20) # 20 univs
main()
'''
输出结果
    排名    	 学校名称 	    总分    
    1     	 清华大学 	   95.9   
    2     	 北京大学 	   82.6   
    3     	 浙江大学 	    80    
    4     	上海交通大学	   78.7   
    5     	 复旦大学 	   70.9   
    6     	 南京大学 	   66.1   
    7     	中国科学技术大学	   65.5   
    8     	哈尔滨工业大学	   63.5   
    9     	华中科技大学	   62.9   
    10    	 中山大学 	   62.1   
    11    	 东南大学 	   61.4   
    12    	 天津大学 	   60.8   
    13    	 同济大学 	   59.8   
    14    	北京航空航天大学	   59.6   
    15    	 四川大学 	   59.4   
    16    	 武汉大学 	   59.1   
    17    	西安交通大学	   58.9   
    18    	 南开大学 	   58.3   
    19    	大连理工大学	   56.9   
    20    	 山东大学 	   56.3   
'''

版本2

import requests
from bs4 import BeautifulSoup
import bs4
 
def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
 
def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])
 
def printUnivList(ulist, num):
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名","学校名称","总分",chr(12288)))
    for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))
     
def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20) # 20 univs
main()
'''
输出结果
    排名    	　　　学校名称　　　	    总分    
    1     	　　　清华大学　　　	   95.9   
    2     	　　　北京大学　　　	   82.6   
    3     	　　　浙江大学　　　	    80    
    4     	　　上海交通大学　　	   78.7   
    5     	　　　复旦大学　　　	   70.9   
    6     	　　　南京大学　　　	   66.1   
    7     	　中国科学技术大学　	   65.5   
    8     	　哈尔滨工业大学　　	   63.5   
    9     	　　华中科技大学　　	   62.9   
    10    	　　　中山大学　　　	   62.1   
    11    	　　　东南大学　　　	   61.4   
    12    	　　　天津大学　　　	   60.8   
    13    	　　　同济大学　　　	   59.8   
    14    	　北京航空航天大学　	   59.6   
    15    	　　　四川大学　　　	   59.4   
    16    	　　　武汉大学　　　	   59.1   
    17    	　　西安交通大学　　	   58.9   
    18    	　　　南开大学　　　	   58.3   
    19    	　　大连理工大学　　	   56.9   
    20    	　　　山东大学　　　	   56.3   
'''