【问题标题】:Jsoup tagName() gives wrong tagJsoup tagName() 给出错误的标签
【发布时间】:2016-05-15 00:30:40
【问题描述】:

我有以下 HTML:

    <p>                         
     <a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"> Jill Martin rescues Savannah Guthrie from her guest room mess </a>   
    <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a>   
    <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a>   
    <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>    
    </p>                        

本文来自网页http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861

还有一段代码:

    Document document = Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861").get(); 
    String tag = null;
    for (Element element : document.select("*") ) { 
        tag = element.tagName();

        if ( "a".equalsIgnoreCase( tag ) ) {
            LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling() );
        }


        if ( StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah") ) {
            LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling() );
            LOGGER.info("tag : {}; nextNodeSibling: {}", tag, element.nextSibling() );
            LOGGER.info("element : {}; previousElementSibling: {}", element.ownText(), element.previousElementSibling() );
        }

}

我得到的输出:

    element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: null
    tag : h2; nextNodeSibling:  
    element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null

有很多问题:

  1. 在主要的 HTML 源代码中,有许多标记为 a 的元素,但我正在检查的小 HTML 片段中没有一个元素
  2. 似乎&lt;a&gt; 被捕获为&lt;h2&gt;
  3. element.nextElementSibling() 在大多数情况下为空

但是,如果仅针对小块进行测试,问题就会消失。因此,当标签出现在较大的 HTML 源中时,Jsoup 似乎无法正确识别标签。

知道为什么吗?

谢谢。

编辑 2

练习背后的目的是清理网页。这就是为什么我遍历了整个 HTML,而不是 @Stephan 建议的特定部分。我只选择了一个看到有问题的特定部分。

但在检查了来自@luksch 的回复后,我重新查看了原始 HTML 并找到了拍摄异常的位置。该代码总体上查看了所有标签,但对a 给出了例外。在主要来源中,我们有article,然后是afigure(其中包含iimgimgsmallsmall)、h2。这个问题似乎所有标签(a 除外)都被删除(根据需要工作),但它们的text 被留下了。这就是为什么我最终得到了&lt;a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"&gt; Jill Martin rescues Savannah Guthrie from her guest room mess &lt;/a&gt;,它不在原始 HTML 源代码中。

Jill Martin 从客房混乱中救出 Savannah Guthrie 是来自 &lt;h2&gt; 的文本,但 &lt;h2&gt; 被删除并留下了文本。有趣的是,Jsoup 仍然将文本识别为来自h2,尽管最终输出没有h2

【问题讨论】:

  • sn-p 是大代码的一部分。原始链接是http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861。因此较大的文档应该是Document doc = Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-‌​tips-living-luxury-less-t70861").get();
  • 网址给了我一个 404
  • @luksch,我复制粘贴时似乎出了点问题。这是调用:Jsoup.connect("today.com/home/…;。'living'后面的词是'luxury'但是复制粘贴出错了。
  • 请编辑您的问题,然后以可重现的方式显示错误。

标签: java html-parsing jsoup


【解决方案1】:

您提供的网址包含此元素:

<a class="player-tease-link" href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959">
<figure class="player-tease">
  <i class="player-tease-icon icon-video-play"></i>
  <img class="tease-icon-play" src="http://nodeassets.today.com/img/svg/641a740d.video-play-white.svg" alt="Play">
  <img class="tease-image" src="http://media1.s-nbcnews.com/j/MSNBC/Components/Video/__NEW/tdy_guth_clutter_160120.today-vid-post-small-desktop.jpg" title="Jill Martin rescues Savannah Guthrie from her guest room mess" alt="Jill Martin rescues Savannah Guthrie from her guest room mess">
  <small class="tease-sponsored">Sponsored Content</small>
  <small class="tease-playing">Now Playing</small>
</figure>
<h2 class="player-tease-headline">Jill Martin rescues Savannah Guthrie from her guest room mess</h2>
</a>

看来您确实将苹果与橙子进行了比较,这意味着您也提供的 html sn-p 不是原始 HTML 的一部分。我猜你使用了一些已经改变了 HTML 的提取工具。请注意,a 元素不包含任何自己的文本!

听从@Stephan 的建议并学习如何使用CSS selectors properly 是一个好主意。这应该比全选然后在程序代码中手动过滤要高效得多。这是您可以执行的操作的示例:

 Elements interestingAs = document.select("a:matches(^Jill Martin)");

这会选择所有包含以“Jill Martin”开头的文本的a 元素。

【讨论】:

  • 我查看了源 HTML 并与我得到的最终输出进行了比较,发现了异常。简而言之,一些标签被删除但留下了他们的text。如果没有删除父级,则留下的文本将分配给此标记(父级)。我们最终得到带有错误文本标签的最终输出。
【解决方案2】:

我认为选择器需要更具体。

试试document.select("a"),而不是document.select("*")

【讨论】:

    【解决方案3】:

    这对我来说是不可重现的。以下程序准确地打印出您所期望的:

    String html = ""
            +"<p>"
            +"    <a href=\"http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959\" rel=\"nofollow\"> Jill Martin rescues Savannah Guthrie from her guest room mess </a>  "
            +"    <a href=\"http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678\" rel=\"nofollow\"> 4 simple ways to clear your clutter this year </a>  "
            +"    <a href=\"http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814\" rel=\"nofollow\"> Staying home on New Year's Eve? Great ideas to celebrate at home </a>  "
            +"    <a href=\"http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749\" rel=\"nofollow\"> Here's how to set a functional Christmas table </a>   "
            +"</p>";
    
    Document doc = Jsoup.parse(html);
    
    String tag = null;
    for (Element element : doc.select("*") ) { 
        tag = element.tagName();
    
        if ( "a".equalsIgnoreCase( tag ) ) {
            System.out.println("element : "+element.ownText()+"; nextElementSibling: "+element.nextElementSibling()+"" );
    
        }
        if ( StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah") ) {
            System.out.println("element : "+element.ownText()+"; nextElementSibling: "+element.nextElementSibling()+"" );
            System.out.println("tag : "+tag+"; nextNodeSibling: "+element.nextSibling()+"" );
            System.out.println("element : "+element.ownText()+"; previousElementSibling: "+element.previousElementSibling()+"" );   
        }
    }
    

    结果是:

    element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a>
    tag : a; nextNodeSibling:  
    element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null
    element : 4 simple ways to clear your clutter this year; nextElementSibling: <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a>
    element : Staying home on New Year's Eve? Great ideas to celebrate at home; nextElementSibling: <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>
    element : Here's how to set a functional Christmas table; nextElementSibling: null
    

    也许您使用了错误的 JSoup 版本?以上是使用 1.8.3 版本运行的

    【讨论】:

    • sn-p 是大代码的一部分。我只是提取了我认为不起作用的部分。一般来说,我试图解析http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861 的内容(其中包含我发布的sn-p)。而不是Document doc = Jsoup.parse(html); 尝试Document doc = Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861").get();
    • 以前的复制粘贴有问题。正确的呼叫是Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861").get();
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-04-14
    • 1970-01-01
    • 2014-07-17
    • 2016-06-09
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多