爬虫（二）爬取今日头条图片

爬取今日头条图片

声明：此篇文章主要是观看静觅教学视频后做的笔记，原教程地址https://cuiqingcai.com/

自己很菜慢慢学习，刚学2天有啥问题请多指教

一、实现流程介绍

1.分析今日头条网站

2.抓取索引页内容

　　 3.抓取详细页内容

4.下载图片并且保存入数据库

二、具体实现

2.1 分析今日头条网站

1. 首先访问今日头条网站输入关键字来到索引页，我们需要通过分析网站来拿到进入详细页的url

2.通过点击查看data中的内容，我们可以看到访问详细页的url，所以这是一会我们需要获取的信息.

3.随着向下滑动滚动条显示更多的图片索引，我们会发现刷出了很多新的ajax请求如下图所示，通过这个我们可以知道我们之后可以通过改变offset中的参数来获取不同的拿到不同的索引界面，从而获得不同的图集详细页url

4.接下来就是分析查找图集详细页的代码，来找到图片的url，这里自己在学习的时候遇到了些坑，利用Google浏览器当利用利用“检查”来分析页面时候，原网站由

　　https://m.toutiao.com/a6511830952644182542/

转化为

　　https://m.toutiao.com/a6511830952644182542/

这样子在DOC中就看不到图片的信息，自己比较菜找了好久也没找到，然后就换了个浏览器试试发现，火狐浏览器不会发生如此情况，所以后面访问分析的时候利用的火狐浏览器


   后面分析代码可以看出找到了url的位置，在gallery那里，这样子分析页面的工作就基本完成了剩下的就是利用代码实现了

2.2代码实现

代码这里就简要的说说，学了2天发现难处还是在分析网站方面，剩下的就是利用工具进行抓取


import json
import re
from _md5 import md5
from json import JSONDecodeError
import os
from bs4 import BeautifulSoup
import requests
import pymongo
from requests import RequestException
from config import *
from multiprocessing import Pool
client = pymongo.MongoClient(MONGO_URL, connect=False)
db = client[MONGO_DB]


def get_page_index(offset, keyword):
    data = {
        \'offset\': offset,
        \'format\': \'json\',
        \'keyword\': keyword,
        \'autoload\': \'true\',
        \'count\': \'20\',
        \'cur_tab\': 1
    }
    headers = {\'User-Agent\': \'MOzilla/5.0\'}
    url = \'https://www.toutiao.com/search_content/?\'
    try:
        response = requests.get(url, params=data, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print(\'请求页面错误\')
        return None


def get_page_detail(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print(\'request the web error\', url)
        return None


def parse_page_detail(html, url):
    soup = BeautifulSoup(html, \'lxml\')
    title = soup.select(\'title\')[0].get_text()
    pattern = re.compile(\'gallery: JSON\.parse\("(.*?)"\),\', re.S)
    gallery = re.search(pattern, html)
    if gallery:
        gallery = gallery.group(1)
        gallery = re.sub(r\'\\\', \'\', gallery)
        data = json.loads(gallery)
        if data and \'sub_images\' in data:
            sub_images = data.get(\'sub_images\')
            images = [item.get(\'url\') for item in sub_images]
            for image in images: download_image(image)
            return {
                \'title\': title,
                \'url\': url,
                \'images\': images
             }


def parse_page_index(html):
    try:
        data = json.loads(html)
        if data and \'data\' in data.keys():
            for item in data.get(\'data\'):
                yield item.get(\'article_url\')
    except JSONDecodeError:
        pass


def save_to_mongo(result):
    if db[MONGO_TABLE].insert(result):
        print(\'save to mongoDB sucessfully\',result)
        return True
    return False

def download_image(url):
    print(\'downloading \',url)
    try:
        response = requests.get(url)
        if response.status_code == 200:
            save_image(response.content)
        return None
    except RequestException:
        print(\'save photo error\',url)
    return None


def save_image(content):
    file_path = \'{0}/{1}.{2}\'.format(os.getcwd(),md5(content).hexdigest(),\'jpg\')
    if not os.path.exists(file_path):
        with open(file_path,\'wb\') as f:
            f.write(content)
            f.close()


def main(offest):
    index_html = get_page_index(offest, KEYWORD)
    for url in parse_page_index(index_html):
        if url:
            detail_html = get_page_detail(url)
            if detail_html:
                result = parse_page_detail(detail_html, url)
                if result:
                    save_to_mongo(result)


if __name__ == \'__main__\':
    groups = [x*20 for x in range(GROUP_START, GROUP_END +1)]
    pool=Pool()
    pool.map(main,groups)

config.py

MONGO_URL = \'localhost\'
MONGO_DB = \'toutiao\'
MONGO_TABLE = \'toutiao\'
GROUP_START =1
GROUP_END =20
KEYWORD = \'街拍\'

遇到问题：

1.在利用正则表达式进行匹配的时候如果原文有‘(’，\')\'，\'.\'‘这类符号时那么你在进行正则表达式书写的时候应该在前面加\'\\'

　　　　　　 pattern = re.compile(\'gallery: JSON\.parse\("(.*?)"\),\', re.S)

2. db = client[MONGO_DB]这里应该是方括号而不是（），否则无法正常访问数据库

3. 在Google浏览器中找不到图片url，然后使用的是火狐浏览器然后就找到了2333333

运行之后就可以把图片爬取下来了，然后就可以看.................................................................. emmmm,我是学技术不是看图的