需求:

项目内容:客户要求整理如下:
第一种
原始数据
xx市白云区新市新街新巷16号
直接输出
xx市,白云区,新市新街新巷16号
第二种:
原始数据:
大沙地沙边街
输出数据:
xx市,黄埔区,大沙地沙边街

附加要求.将原始数据输出到第一列

提供的数据如下:

xx市白云区新市新街新巷16号
xx市花都区狮岭镇岭南工业园合和东路10号
xx市增城区派潭镇大埔村牛角塘一巷
xx市白云区同德街道同嘉路诚德大厦
荔湾区南岸铁路边7号顺景楼
xx市天河区车陂街道车陂高地大街
大沙地沙边街
寺右一马路96号201房
xx市海珠区龙凤街道革新路80号
xx市增城区新塘镇沙埔镇港口村

调用API进行爬取
  http://api.map.baidu.com/place/v2/search?q=%s&region=xx市&output=json&ak=vCx0pfB4y3UNeno7INcCi5wCSv4Gqaij

  帮朋友写一个爬取地区信息的脚本 

 有些是街道,就要通过街道去获取其所在区号.市倒是不用担心因为都是广东.

先开始写一个函数尝试爬取

 1 #!/usr/bin/env python
 2 #encoding=utf-8
 3 #by i3ekr
 4 
 5 import requests,re,time,json
 6 
 7 success_list = []
 8 def shell(values):
 9     json_data = json.loads(requests.get("http://api.map.baidu.com/place/v2/search?q=%s&region==xx市&output=json&ak=vCx0pfB4y3UNeno7INcCi5wCSv4Gqaij" % (values)).content)
10     print json_data
11     try:
12         for n in range(0, len(json_data) + 1):
13             c2 = json_data['results'][n]['area']
14             c1 = u'xx市'
15             c3 = values.decode('utf-8')
16             if c1 in c3:
17                 c3 = c3.replace(c1, "")
18             if c2 in c3:
19                 c3 = c3.replace(c2, "")
20                 success_list.append(c1 + "," + c2 + "," + c3)
21             print c2
22             break
23     except Exception as e:
24         print "error"

刚开始的时候我爬取的数据json格式是固定的

c2 = json_data['results'][1]['area']

后来发现这个area并不全都在第一个数据里.所以选择了先获取results的长度然后再进行结合try遍历,如果获取到就正常得到area并且break跳出循环遍历.

最后就是将这个函数进行封装然后进行利用即可.

最终得到代码如下:

 1 #!/usr/bin/env python
 2 # -*- coding: utf-8 -*-
 3 # by i3ekr
 4 #api_1 = vCx0pfB4y3UNeno7INcCi5wCSv4Gqaij
 5 #api_2 = i1tGx6jjU3qFkeylf3S7ejBAoiQ6o91B
 6 import json
 7 import requests
 8 import time
 9 import sys
10 
11 fail_list = []
12 success_list = []
13 def guolv(values):
14     json_data = json.loads(requests.get("http://api.map.baidu.com/place/v2/search?q=%s&region=广州市&output=json&ak=vCx0pfB4y3UNeno7INcCi5wCSv4Gqaij" % (values)).content)
15     try:
16         for n in range(0, len(json_data) + 1):
17             c2 = json_data['results'][n]['address']
18             c1 = u'广州市'
19             c3 = values.decode('utf-8')
20             if c1 in c3:
21                 c3 = c3.replace(c1, "")
22             if c2 in c3:
23                 c3 = c3.replace(c2, "")
24             success_list.append(c1 + "," + c2 + "," + c3)
25             break
26     except Exception as e:
27         fail_list.append(values)
28 
29 def address(values):
30     try:
31         guolv(values)
32     except Exception as e:
33         fail_list.append(values)
34 
35 
36 def shell(values):
37     if "广州市" in values and "" in values:
38         data = values.replace('广州市', '广州市,')
39         success_list.append(data.replace('', '区,'))
40     elif "" in values:
41         jiedao_left = values.split('')[0] + ""
42         jiedao_all = values
43         try:
44             guolv(jiedao_left)
45         except Exception as e:
46             address(values)
47     else:
48         guolv(values)
49 
50 if __name__ == "__main__":
51     with open("data.txt", "r+") as f:
52         lines = f.readlines()
53         now_time = time.time()
54         for i in lines:
55             data = i.strip("\n")
56             print(u"[+] 正在测试: %s" % (data))
57             shell(data)
58 
59 
60 
61     print(u"success %s" % (len(success_list)))
62     print(u"fail    %s" % (len(fail_list)))
63     print(u"----------")
64     print(u'总共用时:%s'%(time.time() - now_time))
65 
66     for i in success_list:
67         with open('success.txt','a+') as f:
68             f.write(i+"\n")
69     with open('success.txt','a+') as f:
70         f.write("[-]以下是失败的----------------------------------------------------\n")
71     for x in fail_list:
72         with open('success.txt','a+') as f:
73             f.write("[-]" + x + "\n")

 

相关文章: