perilong16

使用到的工具:chrome、eclipse、python3(Anaconda3)

    模块:requests、lxml、csv、time

一、数据收集

  1、确定目标---爬取重庆地区的二手房(包括单价、总价、户型、面积等) 

    1)使用chrome打开目标网站,找到需要爬去的数据项

    

 

 

    2)在当前页面按F12,找到目标数据并拷贝xpath值,结果如图1-2-2

      多抓几套房的数据会发现,不同房子的xpath的 li[?] 中数字不同,每页总共60个-也就是最大 li[60]。

      

    

 

        图1-2-1                            图1-2-2

 

    2、分析网页url

     同2)中, 在network下能查看到请求URL,多查看几页可以看出不同的页面的URL 是以 p 后面的数字区分

       

 

二、python代码实现

  1、直接上代码吧---

  1 #!/usr/bin/env python
  2 #-*- coding:utf8 -*-
  3 
  4 \'\'\'
  5 Created on 2018年11月24日
  6 @author: perilong
  7 \'\'\'
  8 import requests
  9 from lxml import etree
 10 import time
 11 import csv
 12 
 13 
 14 \'\'\'
 15 方法名称:spider
 16 功能:    爬取目标网站,并以源码文本
 17 参数:        url    目标网址
 18 \'\'\'
 19 def spider(url):
 20     try:
 21         header = {\'user-agent\':\'Mozilla/5.0 (Windows NT 6.1; WOW64) \
 22                     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36\'}
 23         response = requests.get(url=url, headers=header)
 24         return response.text
 25     except:
 26         print(\'failed to spider the target site, please check if the url is correct or the connection is available!\')
 27 
 28 
 29 \'\'\'
 30 方法名称:spider_detail
 31 功能:    解析html源码,提取房屋参数
 32 参数:        url    目标网址
 33 \'\'\'
 34 def spider_detail(url):
 35     response_text = spider(url)
 36     sel = etree.HTML(response_text)
 37     for house_num in range(1, 61):
 38         house_model = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[1]/text()\'
 39                                 %house_num)[0].strip()
 40         house_area = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[2]/text()\'
 41                                 %house_num)[0].strip()
 42         house_floor = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[3]/text()\'
 43                                 %house_num)[0].strip()
 44         house_year = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[4]/text()\'
 45                                 %house_num)[0].strip()
 46         house_location = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[3]/span/text()\'
 47                                 %house_num)[0].strip()
 48         house_price = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[3]/span[2]/text()\'
 49                                 %house_num)[0].strip()
 50         house_total = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[3]/span[1]/strong/text()\'
 51                                 %house_num)[0].strip()
 52         house_connection = sel.xpath(\'//*[@id="houselist-mod-new"]/li[%d]/div[2]/div[2]/span[5]/text()\'
 53                                 %house_num)[0].strip()
 54                                 
 55                                 
 56                                 
 57         house_district = house_location.split(\'\n\')[1].split(\'-\')[0].strip()                        
 58         house_garden = house_location.split(\'\n\')[1].split(\'-\')[1].strip()
 59         house_price = house_price.strip(\'元/m²\')
 60         house_year = house_year.strip(\'年建造\')
 61         house_area = house_area.strip(\'\')
 62         
 63         house_data = [house_model, house_area, house_floor, house_year, \
 64                       house_price, house_total, house_district, house_garden, house_connection]
 65         save_csv(house_data)
 66         
 67         print(\'house_model: \', house_model)
 68         print(\'house_area: \', house_area)
 69         print(\'house_floor: \', house_floor)
 70         print(\'house_year: \', house_year)
 71         print(\'house_garden: \', house_garden)
 72         print(\'house_price: \', house_price)
 73         print(\'house_total: \', house_total)
 74         print(\'house_district: \', house_district)
 75         print(\'house_connection: \', house_connection)
 76         print(\'========================================\')
 77     
 78 
 79 \'\'\'
 80 方法名称:save_csv
 81 功能:    将数据按行储存到csv文件中
 82 参数:        house_data    获取到的房屋数据列表
 83 \'\'\'   
 84 def save_csv(house_data):
 85     try:
 86         with open(\'D:/spider_data/QFange/chongqing.csv\', \'a\', encoding=\'utf-8-sig\', newline=\'\') as f:
 87             writer = csv.writer(f)
 88             writer.writerow(house_data)
 89     except:
 90         print(\'write csv error!\')
 91 
 92 
 93 \'\'\'
 94 方法名称:get_all_urls
 95 功能:    生成所有所有的url并存放到迭代器中
 96 参数:        page_number    需要爬网页总数
 97 返回值:    url            返回一个url的迭代
 98 \'\'\'    
 99 def get_all_urls(page_number):
100     if(type(page_number) == type(1) and page_number > 0):       # 防止错误输入
101         for page in range(1, page_number + 1):
102             url = \'https://chongqing.anjuke.com/sale/p\' + str(page)
103             yield url
104     else:
105         print(\'page_number is incorrect!\')
106         
107         
108 # csv首列写入
109 save_csv([\'house_model\', \'house_area\', \'house_floor\', \'house_year\', \
110                       \'house_price\', \'house_total\', \'house_district\', \'house_garden\', \'house_connection\'])
111 
112 for url in get_all_urls(50):
113     try:
114         time.sleep(20)
115         spider_detail(url)
116     except:
117         print(\'An error has been occurred when spidering house-price of chongqing!\')
118         

  2、爬取结果:

    

想一想 hello world 就心酸。。。

 

分类:

技术点:

相关文章: