Google图片和NASA 网站图片的爬虫

1.根据关键字爬取NASA网站上的图片

首先针对需要爬取的网站进行分析，输入关键字查找需要的内容

通过关键字请求，网页每次会加载20张的缩略图，分析网页源码能够很容易的找到缩略图的url:

然后再点开缩略图，会链接的另一个网页，从这里可以分析出更高分辨率大图的url：

最后根据取得的url地址下载原图就可以了，下面附上源代码


# -*- coding: utf-8 -*-
import urllib
import requests
from bs4 import BeautifulSoup
import re
import json

def getUrl(keyword):
    user_agent = \'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:53.0) Gecko/20100101 Firefox/53.0\'
    results = requests.get("https://nasasearch.nasa.gov/search/images",
                           params={\'affiliate\': \'nasa\', \'query\': keyword},
                           headers={\'User-Agent\': user_agent})

    results.encoding = \'utf-8\'
    s = requests.session()
    s.keep_alive = False
    soup = BeautifulSoup(results.text, \'lxml\')
    # 获取网页中的所有div ,class=url的文本
    for link in soup.find_all(\'div\', class_=\'url\'):
        # 拼接url
        html = requests.get(\'https://\'+link.text)
        soup1 = BeautifulSoup(html.text, \'lxml\')
        # 获取字段
        data = soup1.find(\'script\', attrs={"type": "application/ld+json"})
        # json字符串转换为字典
        jsonobj = json.loads(data.text)
        # 从json块中获取图片地址
        imageUrl = jsonobj[\'@graph\'][0][\'image\'][\'url\']
        namelist = imageUrl.split(\'/\')
        # 获取图片名称
        name = namelist[-1].split(\'.\')[0]
        downloadImage(imageUrl, name)

def downloadImage(imageUrl, name):
    path = \'D:/space/\'
    print(name)
    if imageUrl is not None:
        try:
            image_file = requests.get(imageUrl, stream=True, timeout=9)
        except requests.exceptions.RequestException:
            print(\'网络异常\')
        # else:
            # if image_file.status_code is not requests.codes.ok:
            #print(\'{}\'.format(imageUrl) + \'链接为空！\')
        else:
            image_file_path = \'{}{}.jpg\'.format(path, name)
            print(\'正在下载:\' + \'{}.jpg\'.format(name))
            with open(image_file_path, \'wb\') as f:
                f.write(image_file.content)
            print(\'下载完成！\')


if __name__ == "__main__":
    keyword = input()
    getUrl(keyword)

2.爬取谷歌图片

这里主要使用了一个开源代码，爬虫作者github地址：https://github.com/YoongiKim/AutoCrawler
爬虫的效果还是很不错的，具体的使用作者在主页也详细的说明了