【问题标题】:Python web scraper won't save image filesPython网络刮刀不会保存图像文件
【发布时间】:2020-11-04 01:01:30
【问题描述】:

我开始研究一个小型图像抓取终端程序,该程序应该将图像保存到程序层次结构中的指定文件中。这来自我在网上找到的一个基本教程。但是,每当我在终端中输入搜索词以开始抓取 bing.com(是的,我知道)时,程序就会崩溃。我得到的错误似乎集中在无法识别图像文件类型或无法识别保存图像的文件路径:

from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO

search = input("Search for:")
params = {"q": search}
r = requests.get("http://www.bing.com/images/search", params=params)

soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "thumb"})

for item in links:
    img_obj = requests.get(item.attrs["href"])
    print("Getting", item.attrs["href"])
    title = item.attrs["href"].split("/")[-1]
    img = Image.open(BytesIO(img_obj.content))
    img.save("./scraped_images/" + title, img.format)

抛出错误:发生异常:FileNotFoundError [Errno 2] 没有这样的文件或目录:'./scraped_images/3849747391_4a7dc3f19e_b.jpg'

我尝试添加一个文件路径变量(使用 pathlib)并将其与其他必要的变量连接起来:

from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
from pathlib import Path

image_folder = Path("./scraped_images/")
search = input("Search for:")
params = {"q": search}
r = requests.get("http://www.bing.com/images/search", params=params)

soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "thumb"})

for item in links:
    img_obj = requests.get(item.attrs["href"])
    print("Getting", item.attrs["href"])
    title = item.attrs["href"].split("/")[-1]
    img = Image.open(BytesIO(img_obj.content))
    img.save(image_folder + title, img.format)

抛出错误:发生异常:TypeError + 不支持的操作数类型:“WindowsPath”和“str”

我已经检查了 PIL、BeautifulSoup 等的文档,看看是否有任何更新可能让我搞砸了,我检查了 bing 上的元素以查看类是否正确,甚至尝试通过不同的搜索上课,没有任何效果。我不知所措。任何想法或指导表示赞赏。谢谢!

【问题讨论】:

    标签: python-3.x beautifulsoup io python-requests python-imaging-library


    【解决方案1】:

    我稍微修改了你的代码:

    from bs4 import BeautifulSoup
    import requests
    from pathlib import Path
    import os
    image_folder = Path("./scraped_images/")
    if not os.path.isdir(image_folder):
        print('Making %s'%(image_folder))
        os.mkdir(image_folder)
    search = input("Search for:")
    params = {"q": search}
    r = requests.get("http://www.bing.com/images/search", params=params)
    
    soup = BeautifulSoup(r.text, "html.parser")
    links = soup.findAll("a", {"class": "thumb"})
    
    for item in links:
        img_obj = requests.get(item.attrs["href"])
        print("Getting", item.attrs["href"])
        title = item.attrs["href"].split("/")[-1]
        if img_obj.ok:
            with open('%s/%s'%(image_folder, title), 'wb') as file:
                file.write(img_obj.content)
    

    您可以使用 PIL,但在这种情况下您不需要它。

    我还使用 PIL 改进了代码以更好地工作:

    from bs4 import BeautifulSoup
    import requests
    from PIL import Image
    from io import BytesIO
    from pathlib import Path
    
    s = requests.Session()
    image_folder = Path("./scraped_images/")
    search = input("Search for:")
    params = {"q": search}
    r = s.get("http://www.bing.com/images/search", params=params)
    
    soup = BeautifulSoup(r.text, "html.parser")
    links = soup.findAll("a", {"class": "thumb"})
    
    for item in links:
        try:
            img_obj = s.get(item.attrs["href"], headers={'User-Agent': 'User-Agent: Mozilla/5.0'})
            if img_obj.ok:
                print("Getting", item.attrs["href"])
                title = item.attrs["href"].split("/")[-1]
                if '?' in title:
                    title = title.split('?')[0]
                img = Image.open(BytesIO(img_obj.content))
                img.save(str(image_folder) + '/' + title, img.format)
            else:
                continue
        except OSError:
            print('\nError downloading %s try to visit'
                  '\n%s\n'
                  'manually and try to get the image manually.\n' %(title, item.attrs["href"]))
    

    我使用请求会话并添加了尝试,除非 PIL 无法制作图像。如果请求从网站获得 200 响应,我也只会尝试制作图像。

    【讨论】:

    • 抱歉延迟响应,这似乎工作得更好。感谢您的帮助!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-09-07
    • 1970-01-01
    • 1970-01-01
    • 2021-10-19
    • 2016-06-08
    • 2021-08-09
    • 2015-09-18
    相关资源
    最近更新 更多