如何从网页的 HTML 中获取绝对 URL答案

【问题标题】：How to get absolute URLs from HTML of a webpage如何从网页的 HTML 中获取绝对 URL
【发布时间】：2015-01-02 00:00:09
【问题描述】：

这是一个例子。如果您转到http://superior.edu.pk 的源代码 (Ctrl + U)，您将看到没有基本 URL 和/或 http-equi 等。当您向下滚动查看图像时，它将显示不同的 URL 以完成相对路径当您检查 HTML url（例如搜索 AdmissionSchedule.aspx）时，您会看到不同的 URL 解析相对路径。我的问题是：如何将这些相对 URL 作为绝对 URL？我试过 jsoup abd:hre 和 element.absUrl("href");都给我空字符串。设置 document.setBaseUri("http://www.example.com");也不起作用，因为有两个不同的 URL 用作基本 URL。

任何帮助都会感谢我。

谢谢

【问题讨论】：

标签： java html html-parsing jsoup web-crawler

【解决方案1】：

为什么说有两个基本 URL？所有相关链接都指向http://superior.edu.pk/presentation/user/（否则不可能！）。

试试下面的代码：

    //If you use an URL you haven't to especify base URL
    Document doc=Jsoup.connect("http://superior.edu.pk/presentation/user/Default.aspx").get();
    //If you use a file or a String you have. Base URL is http://superior.edu.pk/presentation/user/ of course  
    //Document doc = Jsoup.parse(Main.class.getResourceAsStream("page.htm"), "utf-8", "http://superior.edu.pk/presentation/user/");

    //Only as an example. You can fetch any anchor as wou wish.
    Elements links = doc.select("div.footerMaterial > a");
    for (Element link : links){
        String attr = link.absUrl("href");
        System.out.println(attr);
    }

您将正确看到所有绝对 URL。从指向superior.edu.pk的相对链接和指向各自域（www.digitallibrary.edu.pk和www.google.com）的绝对链接获得的链接

（编辑）

你也可以测试这段代码：

    Element link = doc.select(".logo > a:nth-child(1) > img:nth-child(1)").first();
    String attr = link.absUrl("src");
    System.out.println(attr);

会给你：

http://superior.edu.pk/images/logo.jpg

哪个是正确的！

解释是相对 url 是../../images/logo.jpg，即http://superior.edu.pk/presentation/user/../../images/logo.jpg，解析为http://superior.edu.pk/images/logo.jpg。

一个页面只能有一个基本 url！

【讨论】：

例如如果您检查“../../images/logo.jpg”，它将解析为superior.edu.pk/images/logo.jpg，我在此 URL 中看不到“/presentation/user/”。这就是为什么我说相对 URL 使用 2 个基本 URL 解析。
好吧.. ../../images/logo.jpg 也相对于superior.edu.pk/presentation/user/！！检查它：superior.edu.pk/presentation/user/../../images/logo.jpg