1.根据关键字爬取NASA网站上的图片
首先针对需要爬取的网站进行分析,输入关键字查找需要的内容
通过关键字请求,网页每次会加载20张的缩略图,分析网页源码能够很容易的找到缩略图的url:
然后再点开缩略图,会链接的另一个网页,从这里可以分析出更高分辨率大图的url:
最后根据取得的url地址下载原图就可以了,下面附上源代码
# -*- coding: utf-8 -*-
import urllib
import requests
from bs4 import BeautifulSoup
import re
import json
def getUrl(keyword):
user_agent = \'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:53.0) Gecko/20100101 Firefox/53.0\'
results = requests.get("https://nasasearch.nasa.gov/search/images",
params={\'affiliate\': \'nasa\', \'query\': keyword},
headers={\'User-Agent\': user_agent})
results.encoding = \'utf-8\'
s = requests.session()
s.keep_alive = False
soup = BeautifulSoup(results.text, \'lxml\')
# 获取网页中的所有div ,class=url的文本
for link in soup.find_all(\'div\', class_=\'url\'):
# 拼接url
html = requests.get(\'https://\'+link.text)
soup1 = BeautifulSoup(html.text, \'lxml\')
# 获取字段
data = soup1.find(\'script\', attrs={"type": "application/ld+json"})
# json字符串转换为字典
jsonobj = json.loads(data.text)
# 从json块中获取图片地址
imageUrl = jsonobj[\'@graph\'][0][\'image\'][\'url\']
namelist = imageUrl.split(\'/\')
# 获取图片名称
name = namelist[-1].split(\'.\')[0]
downloadImage(imageUrl, name)
def downloadImage(imageUrl, name):
path = \'D:/space/\'
print(name)
if imageUrl is not None:
try:
image_file = requests.get(imageUrl, stream=True, timeout=9)
except requests.exceptions.RequestException:
print(\'网络异常\')
# else:
# if image_file.status_code is not requests.codes.ok:
#print(\'{}\'.format(imageUrl) + \'链接为空!\')
else:
image_file_path = \'{}{}.jpg\'.format(path, name)
print(\'正在下载:\' + \'{}.jpg\'.format(name))
with open(image_file_path, \'wb\') as f:
f.write(image_file.content)
print(\'下载完成!\')
if __name__ == "__main__":
keyword = input()
getUrl(keyword)
2.爬取谷歌图片
这里主要使用了一个开源代码,爬虫作者github地址:https://github.com/YoongiKim/AutoCrawler
爬虫的效果还是很不错的,具体的使用作者在主页也详细的说明了