使用到的工具:chrome、eclipse、python3(Anaconda3)
模块:requests、lxml、csv、time
一、数据收集
1、确定目标---爬取重庆地区的二手房(包括单价、总价、户型、面积等)
1)使用chrome打开目标网站,找到需要爬去的数据项
2)在当前页面按F12,找到目标数据并拷贝xpath值,结果如图1-2-2
多抓几套房的数据会发现,不同房子的xpath的 li[?] 中数字不同,每页总共60个-也就是最大 li[60]。
图1-2-1 图1-2-2
2、分析网页url
同2)中, 在network下能查看到请求URL,多查看几页可以看出不同的页面的URL 是以 p 后面的数字区分
二、python代码实现
1、直接上代码吧---
1 #!/usr/bin/env python 2 #-*- coding:utf8 -*- 3 4 \'\'\' 5 Created on 2018年11月24日 6 @author: perilong 7 \'\'\' 8 import requests 9 from lxml import etree 10 import time 11 import csv 12 13 14 \'\'\' 15 方法名称:spider 16 功能: 爬取目标网站,并以源码文本 17 参数: url 目标网址 18 \'\'\' 19 def spider(url): 20 try: 21 header = {\'user-agent\':\'Mozilla/5.0 (Windows NT 6.1; WOW64) \ 22 AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36\'} 23 response = requests.get(url=url, headers=header) 24 return response.text 25 except: 26 print(\'failed to spider the target site, please check if the url is correct or the connection is available!\') 27 28 29 \'\'\' 30 方法名称:spider_detail 31 功能: 解析html源码,提取房屋参数 32 参数: url 目标网址 33 \'\'\' 34 def spider_detail(url): 35 response_text = spider(url) 36 sel = etree.HTML(response_text) 37 for house_num in range(1, 61): 38 house_model = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[1]/text()\' 39 %house_num)[0].strip() 40 house_area = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[2]/text()\' 41 %house_num)[0].strip() 42 house_floor = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[3]/text()\' 43 %house_num)[0].strip() 44 house_year = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[4]/text()\' 45 %house_num)[0].strip() 46 house_location = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[3]/span/text()\' 47 %house_num)[0].strip() 48 house_price = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[3]/span[2]/text()\' 49 %house_num)[0].strip() 50 house_total = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[3]/span[1]/strong/text()\' 51 %house_num)[0].strip() 52 house_connection = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[5]/text()\' 53 %house_num)[0].strip() 54 55 56 57 house_district = house_location.split(\'\n\')[1].split(\'-\')[0].strip() 58 house_garden = house_location.split(\'\n\')[1].split(\'-\')[1].strip() 59 house_price = house_price.strip(\'元/m²\') 60 house_year = house_year.strip(\'年建造\') 61 house_area = house_area.strip(\'m²\') 62 63 house_data = [house_model, house_area, house_floor, house_year, \ 64 house_price, house_total, house_district, house_garden, house_connection] 65 save_csv(house_data) 66 67 print(\'house_model: \', house_model) 68 print(\'house_area: \', house_area) 69 print(\'house_floor: \', house_floor) 70 print(\'house_year: \', house_year) 71 print(\'house_garden: \', house_garden) 72 print(\'house_price: \', house_price) 73 print(\'house_total: \', house_total) 74 print(\'house_district: \', house_district) 75 print(\'house_connection: \', house_connection) 76 print(\'========================================\') 77 78 79 \'\'\' 80 方法名称:save_csv 81 功能: 将数据按行储存到csv文件中 82 参数: house_data 获取到的房屋数据列表 83 \'\'\' 84 def save_csv(house_data): 85 try: 86 with open(\'D:/spider_data/QFange/chongqing.csv\', \'a\', encoding=\'utf-8-sig\', newline=\'\') as f: 87 writer = csv.writer(f) 88 writer.writerow(house_data) 89 except: 90 print(\'write csv error!\') 91 92 93 \'\'\' 94 方法名称:get_all_urls 95 功能: 生成所有所有的url并存放到迭代器中 96 参数: page_number 需要爬网页总数 97 返回值: url 返回一个url的迭代 98 \'\'\' 99 def get_all_urls(page_number): 100 if(type(page_number) == type(1) and page_number > 0): # 防止错误输入 101 for page in range(1, page_number + 1): 102 url = \'https://chongqing.anjuke.com/sale/p\' + str(page) 103 yield url 104 else: 105 print(\'page_number is incorrect!\') 106 107 108 # csv首列写入 109 save_csv([\'house_model\', \'house_area\', \'house_floor\', \'house_year\', \ 110 \'house_price\', \'house_total\', \'house_district\', \'house_garden\', \'house_connection\']) 111 112 for url in get_all_urls(50): 113 try: 114 time.sleep(20) 115 spider_detail(url) 116 except: 117 print(\'An error has been occurred when spidering house-price of chongqing!\') 118
2、爬取结果:
想一想 hello world 就心酸。。。