数据采集第五次实践

第五次作业

作业①:

要求：
- 熟练掌握 Selenium 查找HTML元素、爬取Ajax网页数据、等待HTML元素等内容。
- 使用Selenium框架爬取京东商城某类商品信息及图片。
候选网站：http://www.jd.com/
关键词：学生自由选择
输出信息：MYSQL的输出信息如下

mNo mMark mPrice mNote mFile

000001 三星Galaxy 9199.00 三星Galaxy Note20 Ultra 5G... 000001.jpg

000002......

mNo	mMark	mPrice	mNote	mFile
000001	三星Galaxy	9199.00	三星Galaxy Note20 Ultra 5G...	000001.jpg
000002......

实现过程：

导入请求头，并设置图片存储路径，并设置变量count限制爬取数量：

headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
imagePath = "E:\PycharmProjects\DataCollecction\\test\\5\\result_1"
count = 0

创建模拟浏览器并设置不可视化：

chrome_options = Options()
chrome_options.add_argument(\'--headless\')
chrome_options.add_argument(\'--disable-gpu\')

self.driver = webdriver.Chrome(chrome_options=chrome_options)

# Initializing variables

链接数据库：

self.con = pymysql.connect(host="127.0.0.1", port=3306, user="root",
                           passwd="qwe1346790", db="datacollection", charset="utf8")
self.cursor = self.con.cursor()

向数据库插入数据的函数：

def insertDB(self, mNo, mMark, mPrice, mNote, mFile):
    try:
        sql = "insert into phones (mNo,mMark,mPrice,mNote,mFile) values (%s,%s,%s,%s,%s)"
        self.cursor.execute(sql, (mNo, mMark, mPrice, mNote, mFile))
    except Exception as err:
        print(err)

访问url，找到搜索框，将关键词输进去：

self.driver.get(url)
keyInput = self.driver.find_element_by_id("key")
keyInput.send_keys(key)
keyInput.send_keys(Keys.ENTER)

分析网页找到想要获取的信息的位置：

图片url格式有两种：

价格及简介：

lis = self.driver.find_elements_by_xpath("//div[@id=\'J_goodsList\']//li[@class=\'gl-item\']")      # 保存手机信息的节点
for li in lis:
    try:
        src1 = li.find_element_by_xpath(".//div[@class=\'p-img\']//a//img").get_attribute("src")      # 第一种图片url格式
    except:
        src1 = ""
    try:
        src2 = li.find_element_by_xpath(".//div[@class=\'p-img\']//a//img").get_attribute("data-lazy-img")    # 第二种图片url格式
    except:
        src2 = ""
    try:
        price = li.find_element_by_xpath(".//div[@class=\'p-price\']//i").text        # 手机价格
    except:
        price = "0"
    try:
        note = li.find_element_by_xpath(".//div[@class=\'p-name p-name-type-2\']//em").text       # 手机简介
        mark = note.split(" ")[0]       # 手机品牌
        mark = mark.replace("爱心东东\n", "")           # 去除无用信息
        mark = mark.replace(",", "")                    # 去除“，”
        note = note.replace("爱心东东\n", "")           # 去除无用信息
        note = note.replace(",", "")                    # 去除“，”

存储图片：

存储函数：

def download(self, src1, src2, mFile):
    data = None
    if src1:
        try:
            req = urllib.request.Request(src1, headers=MySpider.headers)
            resp = urllib.request.urlopen(req, timeout=10)
            data = resp.read()
        except:
            pass
    if not data and src2:
        try:
            req = urllib.request.Request(src2, headers=MySpider.headers)
            resp = urllib.request.urlopen(req, timeout=10)
            data = resp.read()
        except:
            pass
    if data:
        print("download begin", mFile)
        fobj = open(MySpider.imagePath + "\\" + mFile, "wb")
        fobj.write(data)
        fobj.close()
        print("download finish", mFile)

if self.No <= 125:              # 尾号为25，爬取125条数据
    no = str(self.No)
    while len(no) < 6:
        no = "0" + no           # 规范化序号
    print(no, mark, price)      # 打印序号，品牌，简介这三个信息
    if src1:                    # 若图片格式url为src1
        src1 = urllib.request.urljoin(self.driver.current_url, src1)        # 拼接完整的url
        p = src1.rfind(".")         # 找到图片的后缀名
        mFile = no + src1[p:]       # 设置图片存储时的名字
    elif src2:                  # 若图片格式url为src2
        src2 = urllib.request.urljoin(self.driver.current_url, src2)
        p = src2.rfind(".")
        mFile = no + src2[p:]
    if src1 or src2:
        T = threading.Thread(target=self.download, args=(src1, src2, mFile))    # 多线程存储图片
        T.setDaemon(False)
        T.start()
        self.threads.append(T)
    else:
        mFile = ""
    self.insertDB(no, mark, price, note, mFile)		# 插入数据库

翻页处理：

nextPage = self.driver.find_element_by_xpath("//span[@class=\'p-num\']//a[@class=\'pn-next\']")     # 找到下一页
time.sleep(10)
nextPage.click()            # 点击，跳转页面
self.processSpider()        # 回调函数

主函数：

url = "http://www.jd.com"
spider = MySpider()
while True:
    print("1.爬取")
    print("2.显示")
    print("3.退出")
    s = input("请选择(1,2,3):")
    if s == "1":
        spider.executeSpider(url, "手机")
        continue
    elif s == "2":
        spider.showDB()
        continue
    elif s == "3":
        break

结果展示：

心得体会：
- 本次作业主要考察selenium 查找HTML元素、爬取Ajax网页数据、等待HTML元素等内容。需特别注意的是图片格式有两种，要细致考虑，其次是连接数据库，在插入数据时，由于values（）格式没有设置对，造成报错，以后会多加注意。

作业②:

要求：
- 熟练掌握 Selenium 查找HTML元素、实现用户模拟登录、爬取Ajax网页数据、等待HTML元素等内容。
- 使用Selenium框架+MySQL模拟登录慕课网，并获取学生自己账户中已学课程的信息保存到MySQL中（课程号、课程名称、授课单位、教学进度、课程状态，课程图片地址），同时存储图片到本地项目根目录下的imgs文件夹中，图片的名称用课程名来存储。
候选网站：中国mooc网：https://www.icourse163.org

输出信息：MYSQL数据库存储和输出格式

表头应是英文命名例如：课程号ID，课程名称：cCourse……，由同学们自行定义设计表头：

Id	cCourse	cCollege	cSchedule	cCourseStatus	cImgUrl
1	Python网络爬虫与信息提取	北京理工大学	已学3/18课时	2021年5月18日已结束	http://edu-image.nosdn.127.net/C0AB6FA791150F0DFC0946B9A01C8CB2.jpg
2......

实现过程：

设置请求头以及链接数据库等，与第一题相似，在此不再赘述。

应先考虑selenium模拟登陆的问题:

① 先找到登录按钮并点击：

self.driver.get(url)
self.driver.find_element_by_xpath("//div[@id=\'app\']/div/div/div[1]/div[3]/div[3]/div").click()      # 点击登陆
time.sleep(1)

② 然后点击其他登录方式：

self.driver.find_element_by_xpath("//div[starts-with(@id,\'auto-id\')]/div/div/div/div[2]/span").click()      # 点击其他登录方式
time.sleep(1)

③ 点击手机号登录：

self.driver.find_element_by_xpath("//div[starts-with(@id,\'auto-id\')]/div/div/div/div/div[1]/div/div[1]/div[1]/ul/li[2]").click()    # 点击手机号登录
time.sleep(1)

④ 处理iframe，输入手机号与密码，并点击登录按钮：

self.driver.switch_to.frame(self.driver.find_elements_by_tag_name("iframe")[1])     # 处理iframe
        time.sleep(1)
        self.driver.find_element_by_xpath("//input[@id=\'phoneipt\']").send_keys("182xxxx5186")   # 输入手机号
        time.sleep(1)
        self.driver.find_element_by_xpath("//input[@placeholder=\'请输入密码\']").send_keys("xxxxxxx")  # 输入密码
        time.sleep(1)
        self.driver.find_element_by_xpath("//a[@id=\'submitBtn\']").click()       # 点击登录按钮
        time.sleep(3)
        print(self.driver.current_url)      # 打印出登录后的页面url

⑤ 点击个人中心：

self.driver.find_element_by_xpath("//div[@class=\'m-navTop-func-i\']//div[@class=\'web-nav-right-part\']/div[@class=\'ga-click u-navLogin-myCourse\']//div[@class=\'ga-click u-navLogin-myCourse u-navLogin-center-container\']//a").click()        # 点击个人中心
time.sleep(3)

分析网页，找到我们所需要的信息：

for div in divs:
    try:
        imgUrl = div.find_element_by_xpath(".//a/div[@class=\'img\']/img").get_attribute("src")
        imgUrl = imgUrl.split("?")[0]
    except:
        imgUrl = ""
    try:
        id = self.count
        course = div.find_element_by_xpath(".//a/div[@class=\'body\']//div[@class=\'title\']//span[@class=\'text\']").text
        college = div.find_element_by_xpath(".//a/div[@class=\'body\']//div[@class=\'school\']/a").text
        schedule = div.find_element_by_xpath(".//a/div[@class=\'body\']//div[@class=\'personal-info\']//div[@class=\'text\']/a/span").text
        course_statues = div.find_element_by_xpath(".//a/div[@class=\'body\']/div[@class=\'personal-info\']/div[@class=\'course-status\']").text
    except:
        schedule = ""
        course_statues = ""

存储图片并将数据插入数据库：

存储函数：

def download(self, src, mFile):
    req = urllib.request.Request(src, headers=MySpider.headers)
    resp = urllib.request.urlopen(req, timeout=10)
    data = resp.read()
    print("download begin", mFile)
    fobj = open(MySpider.imagePath + "\\" + mFile, "wb")
    fobj.write(data)
    fobj.close()
    print("download finish", mFile)

存储过程：

mFile = str(self.count) + ".jpg"        # 设置图片名称
T = threading.Thread(target=self.download, args=(imgUrl, mFile))        # 多线程存储图片
T.setDaemon(False)
T.start()
self.threads.append(T)
self.count += 1     # count自增一
self.insertDB(id, course, college, schedule, course_statues, imgUrl)        # 插入数据

由于个人中心中有MOOC与SPOC两个模块，所以设置变量flag来控制何时爬取另一个模块：

flag = 1
if self.flag == 1:
    self.flag = 0
    self.driver.find_element_by_xpath("//div[@id=\'j-module-tab\']/div/div[2]/a").click()  # 点击SPOC模块
    time.sleep(3)
    self.processSpider()  # 回调函数

主函数运行：

def executeSpider(self, url):
        starttime = datetime.datetime.now()
        print("Spider starting......")
        self.startUp(url)
        print("Spider processing......")
        self.processSpider()
        print("Spider closing......")
        self.closeUp()
        for t in self.threads:
            t.join()
        print("Spider completed......")
        endtime = datetime.datetime.now()
        elapsed = (endtime - starttime).seconds
        print("Total ", elapsed, " seconds elapsed")

url = "https://www.icourse163.org"
spider = MySpider()
spider.executeSpider(url)
# spider.showDB()

结果展示：

心得体会：
- 本题的重难点在于selenium实现用户模拟登录。登录过程中会出现多页面跳转的情况，要注意sleep()的运用。

作业③：

要求：掌握大数据相关服务，熟悉Xshell的使用
- 完成文档 华为云_大数据实时分析处理实验手册-Flume日志采集实验（部分）v2.docx 中的任务，即为下面5个任务，具体操作见文档。
- 环境搭建
  - 任务一：开通MapReduce服务
- 实时分析开发实战：
  - 任务一：Python脚本生成测试数据
  - 任务二：配置Kafka
  - 任务三：安装Flume客户端
  - 任务四：配置Flume采集数据
实现：
- 开通MapReduce服务
- Python脚本生成测试数据
- 配置Kafka
- 安装Flume客户端
- 配置Flume采集数据

第五次作业

作业①:

作业②:

作业③：

完整代码地址