JSON文件解析

最近使用百度文字识别功能来抓取图片内的文字和位置，

百度把识别结果以JSON的形式返了回来，内容如下：

{\'words_result\': [{\'words\': \'勤道天\', \'location\': {\'top\': 190, \'left\': 135, \'width\': 499, \'height\': 136}, \'chars\': [{\'char\': \'勤\', \'location\': {\'top\': 190, \'left\': 135, \'width\': 81, \'height\': 136}}, {\'char\': \'道\', \'location\': {\'top\': 190, \'left\': 385, \'width\': 125, \'height\': 136}}, {\'char\': \'天\', \'location\': {\'top\': 190, \'left\': 509, \'width\': 82, \'height\': 136}}]}, {\'words\': \'刚欲平川智海\', \'location\': {\'top\': 337, \'left\': 161, \'width\': 471, \'height\': 113}, \'chars\': [{\'char\': \'刚\', \'location\': {\'top\': 337, \'left\': 230, \'width\': 51, \'height\': 63}}, {\'char\': \'欲\', \'location\': {\'top\': 337, \'left\': 265, \'width\': 56, \'height\': 62}}, {\'char\': \'平\', \'location\': {\'top\': 347, \'left\': 335, \'width\': 67, \'height\': 76}}, {\'char\': \'川\', \'location\': {\'top\': 384, \'left\': 501, \'width\': 41, \'height\': 66}}, {\'char\': \'智\', \'location\': {\'top\': 381, \'left\': 541, \'width\': 39, \'height\': 68}}, {\'char\': \'海\', \'location\': {\'top\': 378, \'left\': 579, \'width\': 39, \'height\': 68}}]}, {\'words\': \'政治家\', \'location\': {\'top\': 348, \'left\': 186, \'width\': 16, \'height\': 70}, \'chars\': [{\'char\': \'政\', \'location\': {\'top\': 374, \'left\': 186, \'width\': 16, \'height\': 10}}, {\'char\': \'治\', \'location\': {\'top\': 388, \'left\': 186, \'width\': 16, \'height\': 10}}, {\'char\': \'家\', \'location\': {\'top\': 402, \'left\': 186, \'width\': 16, \'height\': 10}}]}, {\'words\': \'任意2套省20%\', \'location\': {\'top\': 704, \'left\': 287, \'width\': 468, \'height\': 76}, \'chars\': [{\'char\': \'任\', \'location\': {\'top\': 704, \'left\': 287, \'width\': 51, \'height\': 76}}, {\'char\': \'意\', \'location\': {\'top\': 704, \'left\': 363, \'width\': 51, \'height\': 76}}, {\'char\': \'2\', \'location\': {\'top\': 704, \'left\': 433, \'width\': 42, \'height\': 76}}, {\'char\': \'套\', \'location\': {\'top\': 704, \'left\': 466, \'width\': 51, \'height\': 76}}, {\'char\': \'省\', \'location\': {\'top\': 704, \'left\': 545, \'width\': 50, \'height\': 76}}, {\'char\': \'2\', \'location\': {\'top\': 704, \'left\': 614, \'width\': 42, \'height\': 76}}, {\'char\': \'0\', \'location\': {\'top\': 704, \'left\': 639, \'width\': 42, \'height\': 76}}, {\'char\': \'%\', \'location\': {\'top\': 704, \'left\': 690, \'width\': 42, \'height\': 76}}]}], \'log_id\': 1380934582706110464, \'words_result_num\': 4}

看着挺乱的是吧，如果不了解JSON文件结构还真是有点头晕呐。

一、什么是JSON文件

如上，内容其实就是一堆字符串。当然它是有结构的，可以用来存储数据。

二、结构分析

学过python的同学都知道“{}“号表示字典（也叫对象），”[]“号是列表（也叫数组）。

仔细看上面，JSON就是通过这两种格式的组合来存储各种复杂数据的。

1、字典

字典就是{‘键名’:键值} 的这么一种形式存数据

键名必须用引号包起来，是个字符串。（单引号双引号都行）

键值可以是任何形式（字符串、数值、列表、字典...）。

它们之间通过冒号”:“关联成一对。形如{”key“:vaule}

如果字典有多个元素，使用逗号”,“隔开。如{"key1":vaule,"key2":vaule,"key3":vaule}

2、列表

列表就是[xx,xx,xx]的形式，元素之间用逗号”,“分割

python里列表甚至可以存储不同类型的元素，如["a","b","c",1,2,3]

3、例子分析：

看最上边的例子，最外层就是一个字典{\'words_result\': [XXX,...] , \'log_id\': 1380934582706110464 , \'words_result_num\': 4}

此字典有三个元素（即键值对），为啥是三个？别忘了元素间是用逗号”,“隔开的。嗯......数完没？

第一个元素\'words_result\': [XXX,...]，键名为\'words_result\'，值是一个列表[XXX]

第二个元素\'log_id\': 1380934582706110464，键名为\'log_id\'，值是个数值

第三个元素\'words_result_num\': 4，键名为\'words_result_num\'，值也是个数值

我们需要的数据（文字及位置信息）都在第一个元素的列表[]里了。

来看看这个[XXX]列表里都有啥：

[{\'words\': \'勤道天\', \'location\': {\'top\': 190, \'left\': 135, \'width\': 499, \'height\': 136}, \'chars\': [{\'char\': \'勤\', \'location\': {\'top\': 190, \'left\': 135, \'width\': 81, \'height\': 136}}, {\'char\': \'道\', \'location\': {\'top\': 190, \'left\': 385, \'width\': 125, \'height\': 136}}, {\'char\': \'天\', \'location\': {\'top\': 190, \'left\': 509, \'width\': 82, \'height\': 136}}]}, {\'words\': \'刚欲平川智海\', \'location\': {\'top\': 337, \'left\': 161, \'width\': 471, \'height\': 113}, \'chars\': [{\'char\': \'刚\', \'location\': {\'top\': 337, \'left\': 230, \'width\': 51, \'height\': 63}}, {\'char\': \'欲\', \'location\': {\'top\': 337, \'left\': 265, \'width\': 56, \'height\': 62}}, {\'char\': \'平\', \'location\': {\'top\': 347, \'left\': 335, \'width\': 67, \'height\': 76}}, {\'char\': \'川\', \'location\': {\'top\': 384, \'left\': 501, \'width\': 41, \'height\': 66}}, {\'char\': \'智\', \'location\': {\'top\': 381, \'left\': 541, \'width\': 39, \'height\': 68}}, {\'char\': \'海\', \'location\': {\'top\': 378, \'left\': 579, \'width\': 39, \'height\': 68}}]}, {\'words\': \'政治家\', \'location\': {\'top\': 348, \'left\': 186, \'width\': 16, \'height\': 70}, \'chars\': [{\'char\': \'政\', \'location\': {\'top\': 374, \'left\': 186, \'width\': 16, \'height\': 10}}, {\'char\': \'治\', \'location\': {\'top\': 388, \'left\': 186, \'width\': 16, \'height\': 10}}, {\'char\': \'家\', \'location\': {\'top\': 402, \'left\': 186, \'width\': 16, \'height\': 10}}]}, {\'words\': \'任意2套省20%\', \'location\': {\'top\': 704, \'left\': 287, \'width\': 468, \'height\': 76}, \'chars\': [{\'char\': \'任\', \'location\': {\'top\': 704, \'left\': 287, \'width\': 51, \'height\': 76}}, {\'char\': \'意\', \'location\': {\'top\': 704, \'left\': 363, \'width\': 51, \'height\': 76}}, {\'char\': \'2\', \'location\': {\'top\': 704, \'left\': 433, \'width\': 42, \'height\': 76}}, {\'char\': \'套\', \'location\': {\'top\': 704, \'left\': 466, \'width\': 51, \'height\': 76}}, {\'char\': \'省\', \'location\': {\'top\': 704, \'left\': 545, \'width\': 50, \'height\': 76}}, {\'char\': \'2\', \'location\': {\'top\': 704, \'left\': 614, \'width\': 42, \'height\': 76}}, {\'char\': \'0\', \'location\': {\'top\': 704, \'left\': 639, \'width\': 42, \'height\': 76}}, {\'char\': \'%\', \'location\': {\'top\': 704, \'left\': 690, \'width\': 42, \'height\': 76}}]}]

开头就见到花括号”{“，没错，看来列表里边存了字典。

这里有个技巧，使用notepad++打开JSON文件，鼠标点到第一个花括号”{“上，与其一对的花括号”}“就会红色高亮显示。

我们大致观察一下，不难发现以下结构：[{},{},{},...]

列表里存了N个字典元素，而且每个字典的结构相同。

取出第一字典，再看看它的结构：

{\'words\': \'勤道天\', \'location\': {\'top\': 190, \'left\': 135, \'width\': 499, \'height\': 136}, \'chars\': [{\'char\': \'勤\', \'location\': {\'top\': 190, \'left\': 135, \'width\': 81, \'height\': 136}}, {\'char\': \'道\', \'location\': {\'top\': 190, \'left\': 385, \'width\': 125, \'height\': 136}}, {\'char\': \'天\', \'location\': {\'top\': 190, \'left\': 509, \'width\': 82, \'height\': 136}}]}

它的结构是：{\'words\': 字符串, \'location\': 字典,chars:列表}

ok,你看出来了吗？至此我想大家应该已经会分析JSON的结构了吧。

三、取出数据

非常简单，都是使用索引器来取出值。

1、取列表元素

如一个列表list=[a,b,c]

想取第一个元素a,就这样list[0]

2、取字典元素

如一个字典dic={"words":"锅大侠","age":100,"sex":"unknown"}

取出年龄100，就这样dic[\'age\']

3、实战演练

需求：取出所有的单个文字和位置，打印到txt文件里。

注：百度的返回结果是名为response的对象，通过response.json()方法直接取得JSON内容。

① 拿到结果列表（即通过键名”words_result“从字典中取出值，值是列表类型）

resultList=response.json()[\'words_result\']

② 遍历列表，取出每个数组元素。只要字的列表

    for item in resultList:
        chars=item[\'chars\']               #字列表

③ 遍历字列表，取出字和位置信息

        for item2 in chars:
            char=item2[\'char\']   		  #字
            location=item2[\'location\']
            top = location[\'top\']         #上
            left = location[\'left\']       #左
            width = location[\'width\']     #宽
            height = location[\'height\']   #高

④ 输出效果

序号   内容
 
1      勤
       宽度：81     高度：136
       左间距：135   右间距：190
 
2      道
       宽度：125     高度：136
       左间距：385   右间距：190
 
3      天
       宽度：82     高度：136
       左间距：509   右间距：190
 
4      刚
       宽度：51     高度：63
       左间距：230   右间距：337
 
5      欲
       宽度：56     高度：62
       左间距：265   右间距：337

......