python爬虫学习之旅（六）

JSON以及用JsonPath爬取数据

JSON
JsonPath

操作符
JsonPath和XPth实用实例

用JsonPath爬取淘宝评论
程序源码

JSON

JSON(JavaScript Object Notation, JS 对象简谱) 是一种轻量级的数据交换格式。它基于 ECMAScript (欧洲计算机协会制定的js规范)的一个子集，采用完全独立于编程语言的文本格式来存储和表示数据。简洁和清晰的层次结构使得 JSON 成为理想的数据交换语言。易于人阅读和编写，同时也易于机器解析和生成，并有效地提升网络传输效率。
————来自百度百科

JsonPath

类似于XPath在xml文档中的定位，JsonPath表达式通常是用来路径检索或设置Json的。其表达式可以接受“dot–notation”和“bracket–notation”格式，例如$.store.book[0].title、$[‘store’][‘book’][0][‘title’]

简单的来说，JsonPath是xpath在json的应用。

操作符

通过一个表格来看看XPath和JsonPath操作符的区别

XPath	JsonPath	描述
/	$	表示根元素
.	@	当前元素
*	*	通配符，可以表示一个名字或数字
/	. or []	子元素
..	n/a	父元素
//	..	递归下降，JsonPath是从E4X借鉴的
@	n/a	属性访问字符
[]	[]	子元素操作符

JsonPath和XPth实用实例

下面通过一个简单的xml例子来说明下XPth和JsonPath在实际使用中的区别

{ "store": {
    "book": [ 
      { "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95
      },
      { "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99
      },
      { "category": "fiction",
        "author": "Herman Melville",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99
      },
      { "category": "fiction",
        "author": "J. R. R. Tolkien",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99
      }
    ],
    "bicycle": {
      "color": "red",
      "price": 19.95
    }
  }
}

XPth	JsonPath	结果
/store/book/author	$.store.book[*].author	书店所有书的作者
//author	$..author	所有的作者
/store/*	$.store.*	store的所有元素，所有的books和bicycle
/store//price	$.store..price	store里面所有东西的price
//book[3]	$..book[2]	第三个书
//book[last()]	$..book[(@.length-1)]	最后一本书
//book[position()< 3]	$..book[0,1] or $..book[:2]	前面的两本书。
//book[isbn]	$..book[?(@.isbn)]	过滤出所有的包含isbn的书
//book[price<10]	$..book[?(@.price<10)]	过滤出价格低于10的书
//*	$..*	所有元素

用JsonPath爬取淘宝评论

在这里需要用Fiddler对网页进行抓包
在对评论进行换页时进行抓包
python爬虫学习之旅（六）
得到下面的结果

需要查看的时第7个响应

会发现这里发出了一个get请求，拿到这个get请求的地址,并对这个地址进行简化，得到下面的url，即是我们这次爬虫需要访问的对象

url = https://rate.taobao.com/feedRateList.htm?auctionNumId=579654818595&userNumId=895000657&currentPageNum=2&pageSize=20

在浏览器中访问这个url，会得到下面结果
python爬虫学习之旅（六）
这是页面的JSON数据，可以通过在线的JSON解析工具进行解析，这里我是用的是 json.cn
解析后会是这样

这里需要注意的是，粘贴过去的JSON数据在开头和末尾会有小括号（），需要手动删除，不然解析不出内容，会报错。同样，在后面的爬虫中也需要删除JSON数据中的小括号，后面会有说明。

通过解析后的JSON数据我们可以看出，每一条评论都是个独立的节点，存在于comments这个节点下，这次爬虫则需要爬取用户名，用户头像，评论时间，评论内容，手机信息这四项内容。
和之前的爬虫流程一样，设置响应头

    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0',
    'Host': 'rate.taobao.com',
    'Referer': 'https://item.taobao.com/item.htm?spm=a230r.1.14.14.431b5d42Ngh3OU&id=579654818595&ns=1&abbucket=2&on_comment=99',
    'Cookie':'',
    }

在这里加入了cookie，是因为之前在爬取过程中，没加cookie的时候会连接不上这个网址，遂直接加进去了，可以在Fiddler中查看自己的cookie

	request = urllib.request.Request(url=url,headers=headers)
    json_text = urllib.request.urlopen(request).read().decode()
    #将json小括号去掉
    json_text = json_text.strip('()\n\t\r')

得到响应，提取其中的JSON数据，并将JSON数据中的小括号去除掉
这段是对JSON数据的处理，转换为python对象，并且将comments中的内容放入列表中方便进行提取

    #将json字符串转化为python对象
    obj = json.loads(json_text)
    #抓取评论内容
    #取出comments列表
    comments_list = obj['comments']
    #遍历列表依次提取每一条评论

通过处理后的数据我们找到了用户名，用户头像，评论时间，评论内容，手机信息这四项内容的具体位置，并且每类信息的位置都是一致的
python爬虫学习之旅（六）

遍历所有的comments，将得到的每类信息都以字典的形式保存
在这里可以直观地看出JsonPath的用法

   for comments in comments_list:
       user = jsonpath.jsonpath(comments,'$..user')[0]
       #用户头像
       face = 'http:' + user['avatar']
       #用户名
       name = user['nick']
       #评论内容
       ping_content = comments['content']
       #评论时间
       ping_time = comments['date']
       #手机信息
       info = jsonpath.jsonpath(comments,'$..sku')[0]
       #将评论信息保存到字典中
       item = {
           '用户头像':face,
           '用户名':name,
           '评论':ping_content,
           '时间':ping_time,
           '信息':info,
       }
       item_list.append(item)

得到结果如下所示，可以对比原网站，内容是相符的，并且每个头像链接打开都可以看到用户头像
python爬虫学习之旅（六）
这个爬虫并不完善，只是对第一页评论进行爬取，如果要爬取多页可以加入循环。并且有需要的话可以参考之前的爬虫过程，将每张图片都下载到本地。

程序源码

import urllib.request
import urllib.parse
import re
import json
import jsonpath

item_list = []

def main():
    url = 'https://rate.taobao.com/feedRateList.htm?auctionNumId=579654818595&userNumId=895000657&currentPageNum=2&pageSize=20'
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0',
    'Host': 'rate.taobao.com',
    'Referer': 'https://item.taobao.com/item.htm?spm=a230r.1.14.3.798a59ceUTFlwz&id=578785851542&ns=1&abbucket=5',
    'Cookie':'',
    }
    request = urllib.request.Request(url=url,headers=headers)
    json_text = urllib.request.urlopen(request).read().decode()
    #将json小括号去掉
    json_text = json_text.strip('()\n\t\r')
    #将json字符串转化为python对象
    obj = json.loads(json_text)
    #取出comments列表
    comments_list = obj['comments']
    #遍历列表依次提取每一条评论
    for comments in comments_list:
        user = jsonpath.jsonpath(comments,'$..user')[0]
        #用户头像
        face = 'http:' + user['avatar']
        #用户名
        name = user['nick']
        #评论内容
        ping_content = comments['content']
        #评论时间
        ping_time = comments['date']
        #手机信息
        info = jsonpath.jsonpath(comments,'$..sku')[0]
        #将评论信息保存到字典中
        item = {
            '用户头像':face,
            '用户名':name,
            '评论':ping_content,
            '时间':ping_time,
            '信息':info,
        }
        item_list.append(item)

if __name__ == '__main__':
    main()

    string = json.dumps(item_list,ensure_ascii=False)
    #保存到文件中
    with open('pint.txt','w',encoding='utf8') as fp:
        fp.write(string)