【问题标题】:Web Scraping - image(s) from a websiteWeb Scraping - 来自网站的图像
【发布时间】:2018-11-13 14:03:01
【问题描述】:

我正在学习来自here的教程,称为

网页抓取简介 (Python) - 第 4 课(下载图片)

下面是我在 Ubuntu 16.04 操作系统上运行的代码:

import urllib
from urllib2 import urlopen, build_opener
from bs4 import BeautifulSoup

def make_soup(url):
    thepage = urlopen(url)

    opener = build_opener()
    opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
    response = opener.open('https://www.imdb.com/search/name?gender=male,female&ref_=nv_tp_cel_1')
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata

soup = make_soup("https://www.imdb.com/search/name?gender=male,female&ref_=nv_tp_cel_1")

i=1

for img in soup.findAll('img'):
    print(img.get('src'))

    filename=str(i)
    i=i+1

    #urllib.urlretrieve(img.get('src'),filename)
    imagefile = open(filename + ".jpeg", 'wb')
    theLink = urllib.urlopen(img.get('src'))
    imagefile.write(theLink.read())
    imagefile.close()

看起来它会下载所有图像,但是当我尝试打开其中任何一个时,我得到:

无法加载图像“1.jpeg”。解释 JPEG 图像文件时出错(不是 JPEG 文件:以 0x3c 0x21) 开头

如果我运行less 1.jpeg,我会得到:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>403 ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Bad request.

<BR clear="all">
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
Request ID: 9aEqiCgrzrSAsiL9Q8uvHlgu4SAaDxdBNclFG3AJjxtKn1R7RA35-A==
</PRE>
<ADDRESS>
</ADDRESS>
</BODY></HTML>

我的目标是从网站下载所有图片,我尝试了其他网站但没有成功。

【问题讨论】:

标签: python web-scraping beautifulsoup


【解决方案1】:

下面的代码可能会对你有所帮助:

import requests, urllib.request
from bs4 import BeautifulSoup

# Make HTTP request
url = "https://www.imdb.com/search/name/?gender=male,female&ref_=nv_tp_cel_1"
response = requests.get(url)
print(response.status_code)

# Parse HTML
soup = BeautifulSoup(response.content, 'html.parser')
response.close()

lister_list = soup.find('div',{"class":"lister-list"})
lister_items = lister_list.find_all("div",{"class":"lister-item"})

for i in lister_items:
    image = {}

    # Find image info inside each item
    image['item'] = i.find("div",{"class":"lister-item-image"}).find("img")
    image['alt'] = image['item']['alt']
    image['src'] = image['item']['src']

    # Save image
    urllib.request.urlretrieve(str(image['src']), f"{image['alt']}.jpg")

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-10-04
    • 2019-09-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多