如何使用 jsoup 获取 html 类型的 url答案

【问题标题】：how just get url with html type with jsoup如何使用 jsoup 获取 html 类型的 url
【发布时间】：2018-06-04 22:13:42
【问题描述】：

我只想下载内容类型为“text/html”的网站，不下载 pdf/mp4/rar... 文件

现在我的代码是这样的：

 Connection connection = Jsoup.connect(linkInfo.getLink()).followRedirects(false).validateTLSCertificates(false).userAgent(USER_AGENT);

 Document htmlDocument = connection.get();

 if (!connection.response().contentType().contains("text/html")) {

     return;
 }

有没有类似的东西：

Jsoup.connect(linkInfo.getLink()).contentTypeOnly("text/html");

【问题讨论】：

标签： java web-crawler jsoup

【解决方案1】：

如果您的意思是在实际下载文件之前需要一种方法来了解文件是否为 HTML，那么您可以使用 HEAD 请求。这将只请求标头，因此您可以在实际下载文件之前检查它是否为text/html。您使用的方法实际上不起作用，因为您正在下载文件并将其解析为 HTML 在检查之前，这将在非 HTML 文件上引发异常。

Connection connection = Jsoup.connect(linkInfo.getLink())
    .method(Connection.Method.HEAD)
    .validateTLSCertificates(false)
    .followRedirects(false)
    .userAgent(USER_AGENT);

Connection.Response head = connection.execute();
if (!head.contentType().contains("text/html")) return;

Document html = Jsoup.connect(head.url())
    .validateTLSCertificates(false)
    .followRedirects(false)
    .userAgent(USER_AGENT)
    .get();

【讨论】：

对于某些网站，它在使用 method(Connection.Method.HEAD) 时抛出 404 错误，尽管 url 是有效的，没有 404 错误