【数据采集与融合技术】第四次大作业

要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；Scrapy+Xpath+MySQL数据库存储技术路线爬取当当网站图书数据
候选网站：http://search.dangdang.com/?key=python&act=input
关键词：学生可自由选择
输出信息：MySQL的输出信息如下

1.思路及代码

码云链接：4/hw_1/demo · 数据采集与融合 - 码云 - 开源中国 (gitee.com)

1.1网页分析

首先进入网站，我选择的关键词是“机器学习”，通过分析网页，发现每个商品链接都是在li标签下。

【数据采集与融合技术】第四次大作业

可以得出Xpath定位每个li标签的语句：//li['@ddt-pit'][starts-with(@class,'line')]

接下来分析每个li标签下的商品的详细信息。

标题：可以看出，标题是在li标签下一层的第一个a标签的title属性里。

【数据采集与融合技术】第四次大作业

价格：价格位置也很容易找到

【数据采集与融合技术】第四次大作业

其余属性的定位都是大同小异的，最终我们可以得出每个字段的Xpath语句

title = li.xpath("./a[position()=1]/@title").extract_first()
price = li.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()
author = li.xpath("./p[@class='search_book_author']/span[position()=1]/a/@title").extract_first()
date = li.xpath("./p[@class='search_book_author']/span[position()=last()- 1]/text()").extract_first()
publisher = li.xpath("./p[@class='search_book_author']/span[position()=last()]/a/@title ").extract_first()
detail = li.xpath("./p[@class='detail']/text()").extract_first()

1.2总体思路

首先编写数据项目类，设置每个需要爬取的字段。包括标题、价格、出版社、出版日期、简介。

编写spider类对网页进行爬取，首先利用Xpath定位每个li标签，然后再对每个li标签进行解析，得出标题、价格、出版社、出版日期、简介字段。

最后编写数据管道类，负责将数据存入数据库。

1.3编写代码

数据项目类：

class GoodsItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    date = scrapy.Field()
    publisher = scrapy.Field()
    detail = scrapy.Field()
    price = scrapy.Field()

爬虫类

import scrapy
from bs4 import UnicodeDammit

from demo.items import GoodsItem


class DangDangSpider(scrapy.Spider):
    # 爬取当当网数据
    name = "DangDangSpider"
    url = 'http://search.dangdang.com/?key=%BB%FA%C6%F7%D1%A7%CF%B0&act=input'
    start_urls = [url]

    def parse(self, response, **kwargs):
        try:
            dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
            data = dammit.unicode_markup
            selector = scrapy.Selector(text=data)
            lis = selector.xpath("//li['@ddt-pit'][starts-with(@class,'line')]")
            for li in lis:
                title = li.xpath("./a[position()=1]/@title").extract_first()
                price = li.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()
                author = li.xpath("./p[@class='search_book_author']/span[position()=1]/a/@title").extract_first()
                date = li.xpath("./p[@class='search_book_author']/span[position()=last()- 1]/text()").extract_first()
                publisher = li.xpath("./p[@class='search_book_author']/span[position()=last()]/a/@title ").extract_first()
                detail = li.xpath("./p[@class='detail']/text()").extract_first()
                item = GoodsItem()
                item["title"] = title.strip() if title else ""
                item["author"] = author.strip() if author else ""
                item["date"] = date.strip()[1:] if date else ""
                item["publisher"] = publisher.strip() if publisher else ""
                item["price"] = price.strip() if price else ""
                item["detail"] = detail.strip() if detail else ""
                yield item
        except Exception as err:
            print(err)

数据管道类：

import pymysql


class GoodsPipeline:
    def __init__(self):
        self.con = pymysql.connect(host='localhost', user='root', password='123456', charset="utf8")
        self.cursor = self.con.cursor()
        self.cursor.execute("CREATE DATABASE IF NOT EXISTS DATA_acquisition")
        self.cursor.execute("USE DATA_acquisition")
        self.cursor.execute("create table IF NOT EXISTS books("
                            "bTitle varchar(512) primary key,"
                            "bAuthor varchar(256),"
                            "bPublisher varchar(256),"
                            "bDate varchar(32),"
                            "bPrice varchar(16),"
                            "bDetail text)")

    def open_spider(self, spider):
        print("opened")
        try:
            self.cursor = self.con.cursor(pymysql.cursors.DictCursor)
            self.cursor.execute("delete from books")
            self.opened = True
            self.count = 0
        except Exception as err:
            print(err)
            self.opened = False

    def close_spider(self, spider):
        if self.opened:
            self.con.commit()
            self.con.close()
            self.opened = False
            print("closed")
            print("总共爬取", self.count, "本书籍")

    def process_item(self, item, spider):
        try:
            if self.opened:
                self.cursor.execute("insert into books (bTitle,bAuthor,bPublisher,bDate,bPrice,bDetail) "
                                    "values (%s,%s,%s,%s,%s,%s)",
                                    (item["title"], item["author"], item["publisher"], item["date"],
                                     item["price"], item["detail"]))
                self.count += 1
        except Exception as err:
            print(err)
            print("插入失败")
        return item

编写setting项

BOT_NAME = 'demo'

SPIDER_MODULES = ['demo.spiders']
NEWSPIDER_MODULE = 'demo.spiders'

ROBOTSTXT_OBEY = False

LOG_LEVER = 'WARNING'
ITEM_PIPELINES = {
    'demo.pipelines.GoodsPipeline': 300,
}

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}

1.4运行结果

【数据采集与融合技术】第四次大作业

2.心得体会

经过本次实验，我熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法。同时，对于python中利用pymysql与MySQL数据库进行交互有了初步理解。对于Xpath的掌握程度也大大加深。

作业②:

要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
候选网站：招商银行网：http://fx.cmbchina.com/hq/
输出信息：MySQL数据库存储和输出格式

Id Currency TSP CSP TBP CBP Time

1 港币 86.60 86.60 86.26 85.65 15：36：30

2......

Id	Currency	TSP	CSP	TBP	CBP	Time
1	港币	86.60	86.60	86.26	85.65	15：36：30
2......

1.思路及代码

码云链接：4/hw_2/hw_2 · 数据采集与融合 - 码云 - 开源中国 (gitee.com)

1.1网页分析

进入网页后，可以看到我们要爬取的数据已经是一个结构化的表的，而且都存放于table标签下。

【数据采集与融合技术】第四次大作业

每一种交易币的记录都在一个<tr>里，每个tr里面的数据属性都存在td标签里，是一种很容易处理的结构化数据。既然是结构化的数据，那就最适合提取每行，然后根据行和列来进行索引就可以得出我们要的结果。

1.2总体思路

首先编写数据项目类，设置每个需要爬取的字段。

编写spider类对网页进行爬取，首先爬取table的每行，既tr。

a = selector.xpath("//table[@class='data']//tr")  # 获取表格的所有行
a = a[1:]  # 去除第一行（表头）

然后对每个tr提取出对于下标的td。

#  根据索引获取各个字段的值
Currency = s.xpath("./td[1]/text()").extract_first().strip()
TSP = s.xpath("./td[4]/text()").extract_first().strip()
CSP = s.xpath("./td[5]/text()").extract_first().strip()
TBP = s.xpath("./td[6]/text()").extract_first().strip()
CBP = s.xpath("./td[7]/text()").extract_first().strip()
Time = s.xpath("./td[8]/text()").extract_first().strip()

最后编写数据管道类，负责将数据存入数据库。

1.3编写代码

数据项目类

class Hw2Item(scrapy.Item):
    ID = scrapy.Field()
    Currency = scrapy.Field()
    TSP = scrapy.Field()
    CSP = scrapy.Field()
    TBP = scrapy.Field()
    CBP = scrapy.Field()
    Time = scrapy.Field()

数据管道类：

import pymysql


class Hw2Pipeline:
    def open_spider(self, spider):
        try:
            self.con = pymysql.connect(host='localhost', user='root', password='123456', charset="utf8")
            self.cursor = self.con.cursor()
            self.cursor.execute("CREATE DATABASE IF NOT EXISTS DATA_acquisition")
            self.cursor.execute("USE DATA_acquisition")
            self.cursor.execute("create table IF NOT EXISTS cmb("
                                "id varchar(4) primary key,"
                                "currency varchar(8),"
                                "TSP varchar(8),"
                                "CSP varchar(8),"
                                "TBP varchar(8),"
                                "CBP varchar(8),"
                                "Time varchar(16))")
            self.cursor.execute("DELETE FROM CMB")
            self.DBOpen = True
            print("DB open")
            self.cnt = 0
        except Exception as e:
            print(e)
            self.DBOpen = False

    def close_spider(self, spider):
        if self.DBOpen:
            self.con.commit()
            self.con.close()
            self.DBOpen = False
            print("DB closed")

    def process_item(self, item, spider):
        self.cursor.execute("insert into cmb values (%s,%s,%s,%s,%s,%s,%s)",
                            (str(self.cnt + 1),
                             item['Currency'],
                             item['TSP'],
                             item['CSP'],
                             item['TBP'],
                             item['CBP'],
                             item['Time'])
                            )
        self.cnt += 1
        return item

编写setting

BOT_NAME = 'hw_2'

SPIDER_MODULES = ['hw_2.spiders']
NEWSPIDER_MODULE = 'hw_2.spiders'

ROBOTSTXT_OBEY = False


LOG_LEVEL = 'WARNING'
ITEM_PIPELINES = {
    'hw_2.pipelines.Hw2Pipeline': 300,
}

1.4运行结果

【数据采集与融合技术】第四次大作业

2.心得体会

在爬取table这种结构化数据时，要善于利用表格的特性，可以很轻易的使用下标索引提取出我们想要的内容。

另外，眼见不一定为实，浏览器显示的结果是渲染之后的结果，有些在浏览器可以Xpath定位的语句在用软件爬取的时候可能会出问题。

作业③：

要求：熟练掌握 Selenium 查找HTML元素、爬取Ajax网页数据、等待HTML元素等内容；使用Selenium框架+ MySQL数据库存储技术路线爬取“沪深A股”、“上证A股”、“深证A股”3个板块的股票数据信息。
候选网站：东方财富网：http://quote.eastmoney.com/center/gridlist.html#hs_a_board

输出信息：MySQL数据库存储和输出格式如下，表头应是英文命名例如：序号id，股票代码：bStockNo……，由同学们自行定义设计表头：

序号	股票代码	股票名称	最新报价	涨跌幅	涨跌额	成交量	成交额	振幅	最高	最低	今开	昨收
1	688093	N世华	28.47	62.22%	10.92	26.13万	7.6亿	22.34	32.0	28.08	30.2	17.55
2......

1.思路及代码

码云链接：4/hw_3.py · 数据采集与融合 - 码云 - 开源中国 (gitee.com)

1.1网页分析

与上一题一样，要爬取的数据是存在table里面的，每一条数据都存在tr标签里，每条数据的各个字段存储在td里。

另外，虽然要爬取三个板块，但是每个板块数据存储的位置是相同的，而且板块之间的跳转只需要修改url中相应板块的缩写即可。

比如上证A股板块，只需将url中的hs改成sz，就可以跳转到对应板块。

【数据采集与融合技术】第四次大作业

1.2总体思路及编码

对每个板块：

提取每行记录，既tr组成的列表。

tr_list = driver.find_elements(By.XPATH, "//tbody/tr")

对每行的记录，提取出td，最终形成一个二维列表

table = []
for tr in tr_list:
    td_list = tr.find_elements(By.XPATH, "./td")
    td_text = [x.text for x in td_list]
    table.append(td_text)

根据字段所在的相对位置，通过下表提取出最终需要插入数据库的数据，形成元组列表

table_to_insert = []
for t in table:
    table_to_insert.append((t[0], t[1], t[2], t[4], t[5], t[6], t[7], t[8], t[9], t[10], t[11], t[12], t[13]))

数据库插入数据的核心代码：使用的是executemany插入多条语句

def insert(self, ls):
    self.cursor.executemany("insert into " + self.tableName + " values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)" , ls)

1.3运行结果

沪深板块

【数据采集与融合技术】第四次大作业

上证板块

【数据采集与融合技术】第四次大作业

深证板块

【数据采集与融合技术】第四次大作业

2.心得体会

由于同一股票会存在于不同股票板块，所以不同股票板块的数据要存放在数据库的不同表，否则会导致冲突。

另外，直观感受，使用Selenium框架爬取会比其他框架慢。