如何创建一个文件夹来存储来自网站抓取的图像答案

【问题标题】：How to create one folder to store images from a website scrape如何创建一个文件夹来存储来自网站抓取的图像
【发布时间】：2021-10-03 18:12:13
【问题描述】：

我编写了以下代码来从网站抓取中提取每个产品的图像。我对此很陌生，不确定如何阻止它为每个产品创建一个新文件夹。目前，它在前一个文件夹中创建了一个名为 Whiteline Images 的新文件夹，该文件夹也名为 whiteline images - 当它的 5 个产品时手动修复很容易 - 当我将其更改为 500+ 时就没有那么多了！！我知道它在代码中的哪个位置执行此操作......只是不确定如何修复它。任何帮助表示赞赏！

import requests
from bs4 import BeautifulSoup
import os

def imagedown(url,folder):
try:
        os.mkdir(os.path.join(os.getcwd(), folder))
    except:
        pass    
    os.chdir(os.path.join(os.getcwd(), folder)) 
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')

    images = soup.findAll('img',{"src":True})

    for index, image in enumerate(images, start=1):
        if(image.get('src').startswith('https://imageapi.partsdb.com.au/api/Image')):
            link = (image.get('src'))
            name = f'{soup.find("div", {"class": "head2BR"}).text} ({index})'

            with open(name + '.jpg','wb') as f:
                im = requests.get(link) 
                f.write(im.content)
                print('Writing:', name)

imagedown('https://www.whiteline.com.au/product_detail4.php?part_number=KBR15', 'whiteline_images')
imagedown('https://www.whiteline.com.au/product_detail4.php?part_number=W13374', 'whiteline_images')
imagedown('https://www.whiteline.com.au/product_detail4.php?part_number=BMR98', 'whiteline_images')
imagedown('https://www.whiteline.com.au/product_detail4.php?part_number=W51210', 'whiteline_images')
imagedown('https://www.whiteline.com.au/product_detail4.php?part_number=W51211', 'whiteline_images')

【问题讨论】：

标签： python image web-scraping directory

【解决方案1】：

将图像写入目录时，不要更改目录，而是使用os.path.join：

import requests, os
from bs4 import BeautifulSoup
def imagedown(url, folder):
   if not os.path.isdir(folder): #cleaner to use os.path.isdir when checking for folder existence
      os.mkdir(folder)
   soup = BeautifulSoup(requests.get(url).text, 'html.parser')
   for index, image in enumerate(soup.findAll('img',{"src":True}), start=1): 
      if image.get('src').startswith('https://imageapi.partsdb.com.au/api/Image'):
          link = image.get('src')
          name = f'{soup.find("div", {"class": "head2BR"}).text} ({index})'
           with open(os.path.join(folder, name + '.jpg'), 'wb') as f: #join folder name to new image name
               im = requests.get(link) 
               f.write(im.content)

编辑：更新的解决方案：

def imagedown(url, folder):
   if not os.path.isdir(folder): #cleaner to use os.path.isdir when checking for folder existence
      os.mkdir(folder)
   soup = BeautifulSoup(requests.get(url).text, 'html.parser')
   for i, a in enumerate(soup.select('img:is(.mainman, .thumbbot)'), 1):
        name = soup.select_one('div.head2BR').text+f'({i})'
        with open(os.path.join(folder, name + '.jpg'), 'wb') as f: #join folder name to new image name
           im = requests.get(a['src']) 
           f.write(im.content)

imagedown('https://www.whiteline.com.au/product_detail4.php?part_number=KBR15', 'whiteline_images')
imagedown('https://www.whiteline.com.au/product_detail4.php?part_number=W13374', 'whiteline_images')
imagedown('https://www.whiteline.com.au/product_detail4.php?part_number=BMR98', 'whiteline_images')
imagedown('https://www.whiteline.com.au/product_detail4.php?part_number=W51210', 'whiteline_images')
imagedown('https://www.whiteline.com.au/product_detail4.php?part_number=W51211', 'whiteline_images')

【讨论】：

感谢您的帮助！这绝对可以阻止创建新文件夹，但是现在它只为每个产品下载一个图像，而不是像以前那样全部下载。我不知道如何解决这个问题:(
@LyndaHarmer 澄清一下：当您迭代 soup.findAll('img',{"src":True}) 时，它是否会生成多个图像？还是只有一张图像被写入文件夹？此外，您的原始脚本仍然有效吗？我的回答只是有意义地更改了处理文件夹和图像创建的代码部分。
我的代码为每个产品下载了大约 4 张图片。当我将其更改为您建议的内容时，它停止创建新文件夹，但每个产品只下载一个图像。我创建了代码的 soup.findAll('img',{"src":True}) 部分，因为它正在查找页面上的所有图像（后付款、Facebook 徽标等）。这段代码只是将它们从结果中删除，只给了我产品图片。
@LyndaHarmer 请查看我最近的编辑，我添加了一个能够在我的机器上成功运行的解决方案。请告诉我它是如何为您工作的。
@LyndaHarmer 很高兴为您提供帮助！