抓取 URL 以提取该页面中的所有其他 URL答案

【问题标题】：Crawling a URL in order to extract all the other URLs in that page抓取 URL 以提取该页面中的所有其他 URL
【发布时间】：2015-11-26 10:39:22
【问题描述】：

我正在尝试抓取 URL，以便在每个 URL 中提取其他 URL。为此，我阅读了页面的 HTML 代码，阅读了每一行，将其与模式匹配，然后提取所需的部分，如下所示：

    public class SimpleCrawler {
  static String pattern="https://www\\.([^&]+)\\.(?:com|net|org|)/([^&]+)";

 static Pattern UrlPattern = Pattern.compile (pattern);
 static Matcher UrlMatcher;



    public static void main(String[] args) {

            try {
            URL url = new URL("https://stackoverflow.com/");
            BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
                       while((String line = br.readLine())!=null){
                        UrlMatcher= UrlPattern.matcher(line);


                if(UrlMatcher.find())
                {
            String extractedPath = UrlMatcher.group(1);
            String extractedPath2 = UrlMatcher.group(2);

            System.out.println("http://www."+extractedPath+".com"+extractedPath2);

                }
                }
        } catch (Exception ex) {
            ex.printStackTrace();
        }

    }

}

但是，我想解决一些问题：

如何将http 和www 或两者都设为可选？我遇到过很多情况，链接没有任何一个或两个部分，所以正则表达式不会匹配它们。
根据我的代码，我做了两组，一组在http 之间直到域扩展名，第二组是之后的任何组。然而，这会导致两个子问题： 2.1 由于是HTML代码，其余可能出现在URL后面的HTML标签都会被提取出来。 2.2 在System.out.println("http://www."+extractedPath+".com"+extractedPath2); 中，我无法确定它是否显示正确的 URL（不管以前的问题），因为我不知道它与哪个域扩展匹配。
最后但同样重要的是，我想知道如何同时匹配http 和https？

【问题讨论】：

只是一个想法，我最近做了类似的事情，但我取而代之的是整个标签。它适用于我正在做的事情，因为链接及其标题等已经包含在数据中。可能会有所帮助，具体取决于您需要做什么。这样，无论 URL 以什么开头或结尾，我都得到了一切。您也可以添加过滤器以排除内部页面链接
@Dave 为什么不将其发布为答案？但我仍然需要改进这个正则表达式来匹配那些有或没有http、https 或www. 的未来分析。
因为您的问题是针对正则表达式的，我无法真正帮助您，所以我的帖子只是一个建议或想法
@Dave 我猜你只需要添加一行来创建一个条件，以便仅将的内容获取到正则表达式或类似的东西。
正如@PeeHaa 在 20 分钟前的另一篇文章中所说的 Stop trying to parse html with regex. 使用 html 解析器代替您应该尝试查看 jsoup 库。

标签： java regex web-crawler

【解决方案1】：

怎么样：

try {
    boolean foundMatch = subjectString.matches(
        "(?imx)^\n" +
        "(# Scheme\n" +
        " [a-z][a-z0-9+\\-.]*:\n" +
        " (# Authority & path\n" +
        "  //\n" +
        "  ([a-z0-9\\-._~%!$&'()*+,;=]+@)?              # User\n" +
        "  ([a-z0-9\\-._~%]+                            # Named host\n" +
        "  |\\[[a-f0-9:.]+\\]                            # IPv6 host\n" +
        "  |\\[v[a-f0-9][a-z0-9\\-._~%!$&'()*+,;=:]+\\])  # IPvFuture host\n" +
        "  (:[0-9]+)?                                  # Port\n" +
        "  (/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/?          # Path\n" +
        " |# Path without authority\n" +
        "  (/?[a-z0-9\\-._~%!$&'()*+,;=:@]+(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/?)?\n" +
        " )\n" +
        "|# Relative URL (no scheme or authority)\n" +
        " ([a-z0-9\\-._~%!$&'()*+,;=@]+(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/?  # Relative path\n" +
        " |(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)+/?)                            # Absolute path\n" +
        ")\n" +
        "# Query\n" +
        "(\\?[a-z0-9\\-._~%!$&'()*+,;=:@/?]*)?\n" +
        "# Fragment\n" +
        "(\\#[a-z0-9\\-._~%!$&'()*+,;=:@/?]*)?\n" +
        "$");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

【讨论】：

【解决方案2】：

只有一个库。我使用了 HtmlCleaner。它完成了这项工作。

您可以在以下位置找到它： http://htmlcleaner.sourceforge.net/javause.php

另一个使用 jsoup 的示例（未测试）： http://jsoup.org/cookbook/extracting-data/example-list-links

相当可读。

你可以增强它，选择标签或其他，HREF等...

或者更精确的大小写（HreF，HRef，...）：用于练习

import org.htmlcleaner.*;


public static Vector<String> HTML2URLS(String _source)
{
    Vector<String> result=new Vector<String>();

    HtmlCleaner cleaner = new HtmlCleaner();

    // Principal Node
    TagNode node = cleaner.clean(_source);

    // All nodes
    TagNode[] myNodes =node.getAllElements(true);

    int s=myNodes.length;
    for (int pos=0;pos<s;pos++)
        {
        TagNode tn=myNodes[pos];

        // all attributes
        Map<String,String> mss=tn.getAttributes();

        // Name of tag
        String name=tn.getName();

        // Is there href ?
        String href="";
        if (mss.containsKey("href")) href=mss.get("href");
        if (mss.containsKey("HREF")) href=mss.get("HREF");

        if (name.equals("a")) result.add(href);
        if (name.equals("A")) result.add(href);
        }
    return result;
}

【讨论】：

在哪里可以找到`org.htmlcleaner`？
我的意思是这个 API 的 jar。
关注“下载”页面
“一半”是什么意思？ - 它在 A 标记内将所有 url 作为 href 参数提供，有或没有 http、https、.dot、.net // ...
不不，你误会了，我想要那些零件。我的意思是我的正则表达式不仅应该为 http 提供所有 URL，还应该为 https 提供所有 URL，而不仅仅是 www。但也适用于没有它的那些链接。我并不是要从返回的链接中省略它们。