检查 URL 并下载图像答案

【问题标题】：Inspect URL and download Image检查 URL 并下载图像
【发布时间】：2015-01-25 18:36:59
【问题描述】：

我的目标是编写一个检查以下 URL 的 Java 应用程序：https://familysearch.org/pal:/MM9.3.1/TH-1971-28699-12927-58 能够保存图像（属于旧书的页面的副本）并导航到下一页，重复该过程。可以手动下载图像，但我想自动执行此任务。问题是我对网络不太了解，所以我很难过。

我使用浏览器的网络检查器查看了 URL 中的资源，并得出结论，可以在此处找到图像：https://familysearch.org/pal:/MM9.3.1/TH-1971-28699-12927-58.jpg。

所以我尝试了以下sn-p：

public static void saveImage(String imageUrl, String destinationFile) throws IOException {
        URL url = new URL(imageUrl);
        InputStream is = url.openStream();
        OutputStream os = new FileOutputStream(destinationFile);

        byte[] b = new byte[2048];
        int length;

        while ((length = is.read(b)) != -1) {
            os.write(b, 0, length);
        }

        is.close();
        os.close();
    }

public static void main(String args[]) throws Exception {

        String imageUrl = "https://familysearch.org/pal:/MM9.3.1/TH-1971-28699-12927-58.jpg";
        String destinationFile = "./image.jpg";

        saveImage(imageUrl, destinationFile);
}

这并没有真正奏效。我得到以下输出：

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 500 for URL: https://familysearch.org/pal:/MM9.3.1/TH-1971-28699-12927-58.jpg
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
at java.net.URL.openStream(URL.java:1037)
at mainpackage.Main.saveImage(Main.java:25)
at mainpackage.Main.main(Main.java:44)

所以我有两个问题：第一个问题是如何继续下载图像，第二个问题是如何找到下一张图像的 URL，因为 URL 似乎没有遵循某种模式（例如计数）。

【问题讨论】：

看起来服务器正在阻止热链接。这意味着如果是这样的话，他们不想让你这样做
但是我的浏览器如何访问图像？我可以这样做，右键单击并“另存为”。
好吧 idk..不太清楚它是如何工作的，只是听说过。但是500代码意味着内部服务器错误，这不应该发生在直接访问的图像下载上。
服务器可能正在检查您的用户代理（或其他 HTTP 元数据），并根据用户代理阻止请求。现在，您没有设置任何用户代理。（但您的浏览器在发出请求时确实设置了用户代理。）
如何从我的代码中设置？

标签： java image web web-crawler

【解决方案1】：

这是一个工作示例：

import javax.net.ssl.HttpsURLConnection;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;

public class StackOverflowTest {

    public static void saveImage(final String imageUrl, final String destinationFile) throws IOException {
        final URL url = new URL(imageUrl);
        final HttpsURLConnection urlConnection = (HttpsURLConnection) url.openConnection();

        urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36");
        urlConnection.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
        urlConnection.setInstanceFollowRedirects(true);

        final InputStream is = urlConnection.getInputStream();
        final OutputStream os = new FileOutputStream(destinationFile);

        byte[] b = new byte[2048];
        int length;

        while ((length = is.read(b)) != -1) {
            os.write(b, 0, length);
        }

        is.close();
        os.close();
    }

    public static void main(final String args[]) throws Exception {

        final String imageUrl = "https://familysearch.org/pal:/MM9.3.1/TH-1971-28699-12927-58.jpg";
        final String destinationFile = "./image.jpg";

        saveImage(imageUrl, destinationFile);
    }
}

问题是 Web 服务器需要 Accept 标头，但由于找不到它而失败，返回 500 响应。（此外，图像 URL 执行重定向。）

至于寻找下一张图片：这是一个更复杂的任务。如果没有一种简单的方法来识别下一个图像，您可能需要查看 Java 的 XML/HTML 解析器。 Jsoup (http://jsoup.org/) 是一个又好又快的工具。

【讨论】：

非常感谢！现在我将尝试为第二个问题找到解决方案！