如何使用 Selenium（任何版本）下载图像？答案

【问题标题】：How to download an image using Selenium (any version)?如何使用 Selenium（任何版本）下载图像？
【发布时间】：2011-10-12 10:11:38
【问题描述】：

我想知道，如何使用 selenium/webdriver 下载页面的图像。假设需要用户会话来下载图像，因此使用纯 URL 是没有帮助的。非常感谢任何示例代码。

【问题讨论】：

页面上显示的实际图像。不是页面的整个屏幕截图。
见stackoverflow.com/a/47425305/121322

标签： selenium webdriver

【解决方案1】：

如何下载到文件，从元素文本或属性中获取 URL

完整的扩展代码可以在这里找到：

https://github.com/gravity-api/gravity-core/blob/master/src/csharp/Gravity.Core/Gravity.Core/Extensions/WebElementExtensions.cs

如果你想在不编写代码的情况下使用此方法，请使用 NuGet https://www.nuget.org/packages/Gravity.Core/

Install-Package Gravity.Core -Version 2020.7.5.3

用法

using OpenQA.Selenium.Extensions;
 
...
 
var driver = new ChromeDriver();
 
// from element attribute
var element = driver.FindElement(By.XPath("//img[@id='my_img']")).DownloadResource(path: @"C:\images\cap_image_01.png", attribute: "src");
 
// from element text
var element = driver.FindElement(By.XPath("//div[1]")).DownloadResource(path: @"C:\images\cap_image_01.png");

推荐使用 NuGet，因为它包含更多的 Selenium 工具和扩展

在没有 NuGet 的情况下使用（自行实现）

扩展类

using System.IO;
using System.Net.Http;
using System.Text.RegularExpressions;
 
namespace Extensions
{
    public static class WebElementExtensions
    {
        public static IWebElement DownloadResource(this IWebElement element, string path)
        {
            return DoDownloadResource(element, path, "");
        }
 
        public static IWebElement DownloadResource(this IWebElement element, string path, string attribute)
        {
            return DoDownloadResource(element, path, attribute);
        }
 
        private static IWebElement DoDownloadResource(this IWebElement element, string path, string attribute)
        {
            // get resource address
            var resource = (string.IsNullOrEmpty(attribute))
                ? element.Text
                : element.GetAttribute(attribute);
 
            // download resource
            using (var client = new HttpClient())
            {
                // get response for the current resource
                var httpResponseMessage = client.GetAsync(resource).GetAwaiter().GetResult();
 
                // exit condition
                if (!httpResponseMessage.IsSuccessStatusCode) return element;
 
                // create directories path
                Directory.CreateDirectory(path);
 
                // get absolute file name
                var fileName = Regex.Match(resource, @"[^/\\&\?]+\.\w{3,4}(?=([\?&].*$|$))").Value;
                path = (path.LastIndexOf(@"\") == path.Length - 1)
                    ? path + fileName
                    : path + $@"\{fileName}";
 
                // write the file
                File.WriteAllBytes(path, httpResponseMessage.Content.ReadAsByteArrayAsync().GetAwaiter().GetResult());
            }
 
            // keep the fluent
            return element;
        }
    }
}

用法

using Extensions;
 
...
 
var driver = new ChromeDriver();
 
// from element attribute
var element = driver.FindElement(By.XPath("//img[@id='my_img']")).DownloadResource(path: @"C:\images\cap_image_01.png", attribute: "src");
 
// from element text
var element = driver.FindElement(By.XPath("//div[1]")).DownloadResource(path: @"C:\images\cap_image_01.png");

【讨论】：

【解决方案2】：

虽然@aboy021 JS 代码语法正确，但我无法运行代码。（使用 Chrome V83.xx）

但是这段代码是有效的（Java）：

    String url = "/your-url-goes.here.jpg";
    String imageData = (String) ((JavascriptExecutor) driver).executeAsyncScript(
            "var callback = arguments[0];" + // The callback from ExecuteAsyncScript
                    "var reader;" +
                    "var xhr = new XMLHttpRequest();" +
                    "xhr.onreadystatechange = function() {" +
                    "  if (xhr.readyState == 4) {" +
                        "var reader = new FileReader();" +
                        "reader.readAsDataURL(xhr.response);" +
                        "reader.onloadend = function() {" +
                        "    callback(reader.result);" +
                        "}" +
                    "  }" +
                    "};" +
                    "xhr.open('GET', '" + url + "', true);" +
                    "xhr.responseType = 'blob';" +
                    "xhr.send();");

    String base64Data = imageData.split(",")[1];

    byte[] decodedBytes = Base64.getDecoder().decode(base64Data);
    try (OutputStream stream = new FileOutputStream("c:\\dev\\tmp\\output.jpg")) {
        stream.write(decodedBytes);
    } catch (IOException e) {
        e.printStackTrace();
    }

【讨论】：

花了我很多时间。希望有一天能好好利用它。 :) 也许有人可以告诉我为什么 aboy021 版本不适合我。

【解决方案3】：

我发现避免两次下载图像的唯一方法是使用 Chrome DevTools 协议查看器。

在 Python 中，这给出：

import base64
import pychrome
def save_image(file_content, file_name):
    try:
       file_content=base64.b64decode(file_content)
       with open("C:\\Crawler\\temp\\" + file_name,"wb") as f:
            f.write(file_content)
    except Exception as e:
       print(str(e))

def response_received(requestId, loaderId, timestamp, type, response, frameId):
    if type == 'Image':
        url = response.get('url')
        print(f"Image loaded: {url}")
        response_body = tab.Network.getResponseBody(requestId=requestId)
        file_name = url.split('/')[-1].split('?')[0]
        if file_name:
            save_image(response_body['body'], file_name)


tab.Network.responseReceived = response_received

# start the tab 
tab.start()

# call method
tab.Network.enable()

# get request to target the site selenium 
driver.get("https://www.realtor.com/ads/forsale/TMAI112283AAAA")

# wait for loading
tab.wait(50)

【讨论】：

【解决方案4】：

试试下面的

JavascriptExecutor js = (JavascriptExecutor) driver;                              
String base64string = (String) js.executeScript("var c = document.createElement('canvas');"
                       + " var ctx = c.getContext('2d');"
                       + "var img = document.getElementsByTagName('img')[0];"
                       + "c.height=img.naturalHeight;"
                       + "c.width=img.naturalWidth;"
                       + "ctx.drawImage(img, 0, 0,img.naturalWidth, img.naturalHeight);"
                       + "var base64String = c.toDataURL();"
                       + "return base64String;");
String[] base64Array = base64string.split(",");

String base64 = base64Array[base64Array.length - 1];

byte[] data = Base64.decode(base64);

ByteArrayInputStream memstream = new ByteArrayInputStream(data);
BufferedImage saveImage = ImageIO.read(memstream);

ImageIO.write(saveImage, "png", new File("path"));

【讨论】：

这是一颗宝石！很棒的发现！
旁注：当然，这样你会丢失原始照片的所有EXIF/元数据。
最好！我的解决方案。

【解决方案5】：

对于我的用例，存在 cookie 和其他问题，导致此处的其他方法不适合。

我最终使用 XMLHttpRequest 来填充 FileReader（来自 How to convert image into base64 string using javascript，然后使用 Selenium 的 ExecuteAsyncScript 调用它（如 Selenium and asynchronous JavaScript calls 所示）。这让我得到了一个很简单的 Data URL解析。

这是我用于获取数据 URL 的 C# 代码：

public string ImageUrlToDataUrl(IWebDriver driver, string imageUrl)
{
  var js = new StringBuilder();
  js.AppendLine("var done = arguments[0];"); // The callback from ExecuteAsyncScript
  js.AppendLine(@"
    function toDataURL(url, callback) {
      var xhr = new XMLHttpRequest();
      xhr.onload = function() {
        var reader = new FileReader();
        reader.onloadend = function() {
          callback(reader.result);
        }
        reader.readAsDataURL(xhr.response);
      };
      xhr.open('GET', url);
      xhr.responseType = 'blob';
      xhr.send();
    }"); // XMLHttpRequest -> FileReader -> DataURL conversion
  js.AppendLine("toDataURL('" + imageUrl + "', done);"); // Invoke the function

  var executor = (IJavaScriptExecutor) driver;
  var dataUrl = executor.ExecuteAsyncScript(js.ToString()) as string;
  return dataUrl;
}

【讨论】：

【解决方案6】：

我更喜欢做这样的事情：

1. Get the SRC attribute of the image.
2. Use ImageIO.read to read the image onto a BufferedImage
3. Save the BufferedImage using ImageIO.write function

例如

String src = imgElement.getAttribute('src');
BufferedImage bufferedImage = ImageIO.read(new URL(src));
File outputfile = new File("saved.png");
ImageIO.write(bufferedImage, "png", outputfile);

【讨论】：

【解决方案7】：

为我工作：

# open the image in a new tab
driver.execute_script('''window.open("''' + wanted_url + '''","_blank");''')
sleep(2)
driver.switch_to.window(driver.window_handles[1])
sleep(2)

# make screenshot
driver.save_screenshot("C://Folder/" + photo_name + ".jpeg")
sleep(2)

# close the new tab
driver.execute_script('''window.close();''')
sleep(2)

#back to original tab
driver.switch_to.window(driver.window_handles[0])

【讨论】：

新标签会向源发出新请求吗？
我认为是的，但没有用嗅探器检查它

【解决方案8】：

这是一个 javascript 解决方案。这有点傻——而且我厌倦了用太多请求访问源图像的服务器。谁能告诉我 fetch() 是否访问浏览器的缓存？我不想向源服务器发送垃圾邮件。

它将 FileReader() 附加到窗口，获取图像并将其转换为 base64 并将该字符串标记到窗口上。

然后驱动程序可以返回那个窗口变量。

export async function scrapePic(driver) {
try {
console.log("waiting for that profile piccah")
console.log(driver)

let rootEl = await driver.findElement(By.css('.your-root-element'));
let imgEl = await rootEl.findElement(By.css('img'))
await driver.wait(until.elementIsVisible(imgEl, 10000));
console.log('profile piccah found')
let img = await imgEl.getAttribute('src')
//attach reader to driver window
await driver.executeScript(`window.myFileReader = new FileReader();`)
await driver.executeScript(`
  window.myFileReader.onloadend = function() {
    window['profileImage'] = this.result
  }
  fetch( arguments[0] ).then( res => res.blob() ).then( blob => window.electronFileReader.readAsDataURL(blob) )
  `, img)
await driver.sleep(5000)
let img64 = await driver.executeScript(`return window.profileImage`)
console.log(img64)


} catch (e) {
console.log(e)
} finally {
return img64
  }
}

【讨论】：

bc 窗口已经在页面上加载了个人资料图片，我希望当它执行 fetch() 时，它只是进入缓存...我不完全确定如何确认是否这样是这样的。

【解决方案9】：

此处的其他解决方案不适用于所有浏览器，不适用于所有网站，或两者兼而有之。

这个解决方案应该更加健壮。它使用浏览器查看图像，调整浏览器大小以适合图像大小，截取屏幕截图，最后将浏览器调整回原始大小。

Python：

def get_image(driver, img_url):
    '''Given an images url, return a binary screenshot of it in png format.'''
    driver.get_url(img_url)

    # Get the dimensions of the browser and image.
    orig_h = driver.execute_script("return window.outerHeight")
    orig_w = driver.execute_script("return window.outerWidth")
    margin_h = orig_h - driver.execute_script("return window.innerHeight")
    margin_w = orig_w - driver.execute_script("return window.innerWidth")
    new_h = driver.execute_script('return document.getElementsByTagName("img")[0].height')
    new_w = driver.execute_script('return document.getElementsByTagName("img")[0].width')

    # Resize the browser window.
    logging.info("Getting Image: orig %sX%s, marg %sX%s, img %sX%s - %s"%(
      orig_w, orig_h, margin_w, margin_h, new_w, new_h, img_url))
    driver.set_window_size(new_w + margin_w, new_h + margin_h)

    # Get the image by taking a screenshot of the page.
    img_val = driver.get_screenshot_as_png()
    # Set the window size back to what it was.
    driver.set_window_size(orig_w, orig_h)

    # Go back to where we started.
    driver.back()
    return img_val

此解决方案的一个缺点是，如果图像非常小，浏览器将不会调整那么小，并且您可能会在图像周围出现黑色边框。

【讨论】：

有趣的想法，但这不会提供原始图像

【解决方案10】：

使用 selenium 获取图像 src

elemImg.get_attribute('src')

为此使用编程语言，对于python；检查这个答案： How to save an image locally using Python whose URL address I already know?

【讨论】：

如果不想重复下载图片，请在 chrome 中禁用图片：prefs = {"profile.managed_default_content_settings.images": 2} chrome_options.add_experimental_option("prefs", prefs)

【解决方案11】：

我更喜欢这样：

 WebElement logo = driver.findElement(By.cssSelector(".image-logo"));
 String logoSRC = logo.getAttribute("src");

 URL imageURL = new URL(logoSRC);
 BufferedImage saveImage = ImageIO.read(imageURL);

 ImageIO.write(saveImage, "png", new File("logo-image.png"));

【讨论】：

【解决方案12】：

另一个最正确的解决方案是通过简单的 HTTP 请求直接下载它。
您可以使用 webDriver 的用户会话，因为它存储 cookie。
在我的example 中，我只是在分析它返回的状态码。如果为 200，则图像存在并且可以显示或下载。如果您需要真正下载文件本身 - 您可以从 httpResponse 实体获取所有图像数据（将其用作简单的输入流）。

// just look at your cookie's content (e.g. using browser)
// and import these settings from it
private static final String SESSION_COOKIE_NAME = "JSESSIONID";
private static final String DOMAIN = "domain.here.com";
private static final String COOKIE_PATH = "/cookie/path/here";

protected boolean isResourceAvailableByUrl(String resourceUrl) {
    HttpClient httpClient = new DefaultHttpClient();
    HttpContext localContext = new BasicHttpContext();
    BasicCookieStore cookieStore = new BasicCookieStore();
    // apply jsessionid cookie if it exists
    cookieStore.addCookie(getSessionCookie());
    localContext.setAttribute(ClientContext.COOKIE_STORE, cookieStore);
    // resourceUrl - is url which leads to image
    HttpGet httpGet = new HttpGet(resourceUrl);

    try {
        HttpResponse httpResponse = httpClient.execute(httpGet, localContext);
        return httpResponse.getStatusLine().getStatusCode() == HttpStatus.SC_OK;
    } catch (IOException e) {
        return false;
    }
}

protected BasicClientCookie getSessionCookie() {
    Cookie originalCookie = webDriver.manage().getCookieNamed(SESSION_COOKIE_NAME);

    if (originalCookie == null) {
        return null;
    }

    // just build new apache-like cookie based on webDriver's one
    String cookieName = originalCookie.getName();
    String cookieValue = originalCookie.getValue();
    BasicClientCookie resultCookie = new BasicClientCookie(cookieName, cookieValue);
    resultCookie.setDomain(DOMAIN);
    resultCookie.setExpiryDate(originalCookie.getExpiry());
    resultCookie.setPath(COOKIE_PATH);
    return resultCookie;
}

【讨论】：

【解决方案13】：

如果您需要测试该图像是否可用并且存在，您可以这样做：

protected boolean isResourceAvailableByUrl(String resourceUrl) {
    // backup current url, to come back to it in future
    String currentUrl = webDriver.getCurrentUrl();
    try {
        // try to get image by url
        webDriver.get(resourceUrl);
        // if "resource not found" message was not appeared - image exists
        return webDriver.findElements(RESOURCE_NOT_FOUND).isEmpty();
    } finally {
        // back to page
        webDriver.get(currentUrl);
    }
}

但是你需要确定，在执行这个方法之前，通过 currentUrl 真的会让你回到页面上。在我的情况下是这样的。如果没有 - 您可以尝试使用：

webDriver.navigate().back()

而且，不幸的是，似乎没有任何机会分析响应状态代码。这就是为什么您需要在 NOT_FOUND 页面上找到任何特定的 Web 元素并检查它是否出现然后决定 - 该图像不存在。

这只是解决方法，因为我没有找到任何官方的方法来解决它。

注意：当您使用授权会话获取资源并且不能仅通过 ImageIO 或严格通过 HttpClient 下载资源时，此解决方案很有帮助。

【讨论】：