Jsoup tagName() 给出错误的标签答案

【问题标题】：Jsoup tagName() gives wrong tagJsoup tagName() 给出错误的标签
【发布时间】：2016-05-15 00:30:40
【问题描述】：

我有以下 HTML：

    <p>                         
     <a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"> Jill Martin rescues Savannah Guthrie from her guest room mess </a>   
    <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a>   
    <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a>   
    <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>    
    </p>

本文来自网页http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861

还有一段代码：

    Document document = Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861").get(); 
    String tag = null;
    for (Element element : document.select("*") ) { 
        tag = element.tagName();

        if ( "a".equalsIgnoreCase( tag ) ) {
            LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling() );
        }


        if ( StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah") ) {
            LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling() );
            LOGGER.info("tag : {}; nextNodeSibling: {}", tag, element.nextSibling() );
            LOGGER.info("element : {}; previousElementSibling: {}", element.ownText(), element.previousElementSibling() );
        }

}

我得到的输出：

    element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: null
    tag : h2; nextNodeSibling:  
    element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null

有很多问题：

在主要的 HTML 源代码中，有许多标记为 a 的元素，但我正在检查的小 HTML 片段中没有一个元素
似乎<a> 被捕获为<h2>
element.nextElementSibling() 在大多数情况下为空

但是，如果仅针对小块进行测试，问题就会消失。因此，当标签出现在较大的 HTML 源中时，Jsoup 似乎无法正确识别标签。

知道为什么吗？

谢谢。

编辑 2

练习背后的目的是清理网页。这就是为什么我遍历了整个 HTML，而不是 @Stephan 建议的特定部分。我只选择了一个看到有问题的特定部分。

但在检查了来自@luksch 的回复后，我重新查看了原始 HTML 并找到了拍摄异常的位置。该代码总体上查看了所有标签，但对a 给出了例外。在主要来源中，我们有article，然后是a、figure（其中包含i、img、img、small、small）、h2。这个问题似乎所有标签（a 除外）都被删除（根据需要工作），但它们的text 被留下了。这就是为什么我最终得到了<a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"> Jill Martin rescues Savannah Guthrie from her guest room mess </a>，它不在原始 HTML 源代码中。

Jill Martin 从客房混乱中救出 Savannah Guthrie 是来自 <h2> 的文本，但 <h2> 被删除并留下了文本。有趣的是，Jsoup 仍然将文本识别为来自h2，尽管最终输出没有h2。

【问题讨论】：

sn-p 是大代码的一部分。原始链接是http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861。因此较大的文档应该是Document doc = Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-‌tips-living-luxury-less-t70861").get();
网址给了我一个 404
@luksch，我复制粘贴时似乎出了点问题。这是调用：Jsoup.connect("today.com/home/…;。'living'后面的词是'luxury'但是复制粘贴出错了。
请编辑您的问题，然后以可重现的方式显示错误。

标签： java html-parsing jsoup

【解决方案1】：

您提供的网址包含此元素：

<a class="player-tease-link" href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959">
<figure class="player-tease">
  <i class="player-tease-icon icon-video-play"></i>
  <img class="tease-icon-play" src="http://nodeassets.today.com/img/svg/641a740d.video-play-white.svg" alt="Play">
  <img class="tease-image" src="http://media1.s-nbcnews.com/j/MSNBC/Components/Video/__NEW/tdy_guth_clutter_160120.today-vid-post-small-desktop.jpg" title="Jill Martin rescues Savannah Guthrie from her guest room mess" alt="Jill Martin rescues Savannah Guthrie from her guest room mess">
  <small class="tease-sponsored">Sponsored Content</small>
  <small class="tease-playing">Now Playing</small>
</figure>
<h2 class="player-tease-headline">Jill Martin rescues Savannah Guthrie from her guest room mess</h2>
</a>

看来您确实将苹果与橙子进行了比较，这意味着您也提供的 html sn-p 不是原始 HTML 的一部分。我猜你使用了一些已经改变了 HTML 的提取工具。请注意，a 元素不包含任何自己的文本！

听从@Stephan 的建议并学习如何使用CSS selectors properly 是一个好主意。这应该比全选然后在程序代码中手动过滤要高效得多。这是您可以执行的操作的示例：

 Elements interestingAs = document.select("a:matches(^Jill Martin)");

这会选择所有包含以“Jill Martin”开头的文本的a 元素。

【讨论】：

我查看了源 HTML 并与我得到的最终输出进行了比较，发现了异常。简而言之，一些标签被删除但留下了他们的text。如果没有删除父级，则留下的文本将分配给此标记（父级）。我们最终得到带有错误文本标签的最终输出。

【解决方案2】：

我认为选择器需要更具体。

试试document.select("a")，而不是document.select("*")。

【讨论】：

【解决方案3】：

这对我来说是不可重现的。以下程序准确地打印出您所期望的：

String html = ""
        +"<p>"
        +"    <a href=\"http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959\" rel=\"nofollow\"> Jill Martin rescues Savannah Guthrie from her guest room mess </a>  "
        +"    <a href=\"http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678\" rel=\"nofollow\"> 4 simple ways to clear your clutter this year </a>  "
        +"    <a href=\"http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814\" rel=\"nofollow\"> Staying home on New Year's Eve? Great ideas to celebrate at home </a>  "
        +"    <a href=\"http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749\" rel=\"nofollow\"> Here's how to set a functional Christmas table </a>   "
        +"</p>";

Document doc = Jsoup.parse(html);

String tag = null;
for (Element element : doc.select("*") ) { 
    tag = element.tagName();

    if ( "a".equalsIgnoreCase( tag ) ) {
        System.out.println("element : "+element.ownText()+"; nextElementSibling: "+element.nextElementSibling()+"" );

    }
    if ( StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah") ) {
        System.out.println("element : "+element.ownText()+"; nextElementSibling: "+element.nextElementSibling()+"" );
        System.out.println("tag : "+tag+"; nextNodeSibling: "+element.nextSibling()+"" );
        System.out.println("element : "+element.ownText()+"; previousElementSibling: "+element.previousElementSibling()+"" );   
    }
}

结果是：

element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a>
tag : a; nextNodeSibling:  
element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null
element : 4 simple ways to clear your clutter this year; nextElementSibling: <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a>
element : Staying home on New Year's Eve? Great ideas to celebrate at home; nextElementSibling: <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>
element : Here's how to set a functional Christmas table; nextElementSibling: null

也许您使用了错误的 JSoup 版本？以上是使用 1.8.3 版本运行的

【讨论】：

sn-p 是大代码的一部分。我只是提取了我认为不起作用的部分。一般来说，我试图解析http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861 的内容（其中包含我发布的sn-p）。而不是Document doc = Jsoup.parse(html); 尝试Document doc = Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861").get();
以前的复制粘贴有问题。正确的呼叫是Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861").get();