【问题标题】:How do I obtain the entire code of a website to scrape all images (python)如何获取网站的完整代码以抓取所有图像(python)
【发布时间】:2020-05-21 01:15:16
【问题描述】:

我想使用 Python 创建一个网络爬虫,以创建我自己的狗图片和猫图片数据集。我想从以下站点抓取一定数量的图片:https://unsplash.com/images/animals/dog

我遇到的问题是页面源没​​有显示所有图片,而不是来自 Inspect 元素的代码(包含所有 HTML、CSS 和 JavaScript)。如何获取完整的代码以抓取所有图像?我尝试使用 Selenium 和 Dryscrape,但没有成功......

这是我的代码:

#Import
import requests
from bs4 import BeautifulSoup
import urllib.request
import random
from google.colab import drive

#Directory
drive.mount('/content/drive')
data_dir = 'drive/My Drive/Colab Notebooks/Web scraper/Images/Dogs'

#Image scraper
url = "https://unsplash.com/images/animals/dog"
source_code = requests.get(url)   #Gets source code from website
plain_text = source_code.text     #only gets text from source code
soup = BeautifulSoup(plain_text)  #Parses through the HTML of site

for div in (soup.find_all('div', class_= "_3oSvn IEpfq")):
  img = div.find_all('img')                           #Finds all img in divs

  for link in img:                                    #Traverses all img
    src = link.get("src")                             #Gets contents of src from img
    img_name = random.randrange(1,500)                #creates a unique name
    full_name = data_dir + str(img_name) + ".jpg"     #adds file type name
    urllib.request.urlretrieve(src, full_name)        #Fetch image of url and save into dir

【问题讨论】:

  • 使用 Selenium WebDriver 执行动态加载元素的 JavaScript。
  • @Barmar 我之前尝试过这样做,但我不知道如何将它合并到我的代码中......有什么建议吗?
  • @Barmar 我认为即使使用 selenium 也会遇到一些问题,因为图像是在用户滚动时动态加载的
  • 除了模拟导致它们的用户操作之外,没有自动方法可以查看动态加载的内容。
  • 无法知道可能会加载哪些图像。相当于停机问题。

标签: javascript python html web-scraping beautifulsoup


【解决方案1】:

抱歉回复晚了,我有点忙。

我建议您使用他们的 API 端点,该端点适用于开发人员,而不是实际用户。下面的python代码正是这样做的。我已经对它进行了广泛的评论,但如果您有任何其他问题,请随时提出。

import requests, json

def fetchImages(base_url, maximum, res):
  #create an empty list that will contain the urls
  url_list = []
  #amounts of photos per page, it seemed to be capped at 30
  chunk_size = 30
  #fetch images on a given page index using requests
  def fetchChunk(idx):
    #response
    url = '%s?page=%d&per_page=%d' % (base_url, idx, chunk_size)
    #response text
    return requests.get(url).text
  #parse the received chunk from a string to a dictionnary
  def parseChunk(chunk):
    #the json library does the actual parsing
    data = json.loads(chunk)
    #'photos' is the sub-dictionnary containing the images
    images = data['photos']
    #loop through each photo from the page and keep only the url
    for img in images:
      #returns 5 urls, one for each resolution
      img_url = img['urls'][res]
      #add the url to the list
      url_list.append(img_url)
  #the current page index 
  #although negative indices are valid with that api, I will stick to positive ones by convention
  idx = 0
  #continue fetching pages until there's as many or more images than the max amount
  while len(url_list) < maximum:
    #fetch the chunk
    chunk = fetchChunk(idx)
    #parse it
    parseChunk(chunk)
    #increase the index
    idx += 1
  #trim the list so it contains the maximum amount
  url_list = url_list[:maximum]
  return url_list

#you can set that to 'cat' in order to fetch pictures of cat instead
animal = 'dog'

#api endpoint for image list
base = 'https://unsplash.com/napi/landing_pages/images/animals/'
url = base + animal

#resolution can be 'full', 'raw', 'regular', 'small' or 'thumb'
resolution = 'regular'

#the number of images to fetch, the website has a seemingly endless amount of dog pictures, but I would recommend not setting that number to high
#from what I've seen, fetching 2500 takes about 20 seconds, so if you plan on fetching a whole lot of photos, I would recommend using a specialized API for that
maximum = 60

#prints array of urls
print(fetchImages(url, maximum, resolution))

无论如何,祝你在项目的其余部分好运!

如果您想直接使用代码,这里有一个 repl.it 链接,无需大量注释:https://repl.it/repls/ClosedWarmheartedTheory

【讨论】:

  • 非常感谢!!!它可以按预期完美运行,而且也很容易理解!我只有一个关于 fetchChunk 方法的问题,就是你从哪里得到的 url?
  • @alphaverse159 你可以在这里阅读更多关于 api 的信息:unsplash.com/developers。但如果没有专门的 API,您可以使用浏览器开发工具检查网络请求并尝试模仿相关的。
【解决方案2】:

如果你还需要它,试试这个: https://pypi.org/project/unsplash-get/ 代码示例:

from unsplash_get import search, save_img

# get list of urls
word = 'orange'
urls = search(word)

# store images if needed
for key, url in enumerate(urls[:10]):
    file = '{}_{:03}.jpg'.format(word, key)
    save_img(url, file)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2022-10-14
    • 1970-01-01
    • 2020-08-15
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-03-31
    相关资源
    最近更新 更多