在 Java 中按元素解析 HTML 元素答案

【问题标题】：Parsing HTML Element By Element at Java在 Java 中按元素解析 HTML 元素
【发布时间】：2014-11-13 04:29:29
【问题描述】：

我有一个 HTML 文件：

<div>
   DIV1
    <div>
      DIV2
       <div>
          DIV3
       </div>
    </div>
</div>

我想解析那个 HTML。但是我不想将整个解析的 HTML 作为字符串：

DIV1 DIV2 DIV3

我想逐个元素地获取值，但没有一个是重复的。我的意思是我不想这样：

当你得到第一个 div 的值时，它是：

DIV1 DIV2 DIV3

秒 div 的值：

DIV2 DIV3

第三个div的值：

DIV3

我不想要的结果是：

DIV1 DIV2 DIV3
DIV2 DIV3
DIV3

我想要那个结果：

DIV1
DIV2
DIV2

我将对它们应用一些程序，并且我也不想要重复的值。我想使用 Java 解析器来解决我的问题。我考虑过使用 Jsoup，但使用它时会解析整个 HTML。

【问题讨论】：

标签： java html parsing

【解决方案1】：

听起来您想为 HTML 文档中的所有文本节点执行pre order depth first search。幸运的是，包括 XML 在内的大多数解析库都会为您提供预先订购的所有节点作为迭代器。

我建议您使用 Jericho 并致电 getNodeIterator() 并检查它是否是文本节点，如果是，您将其打印出来。注意该链接有示例代码，但为了您的方便，我将其粘贴在这里：

 for (Iterator<Segment> nodeIterator=segment.getNoteIterator(); nodeIterator.hasNext();) {
   Segment nodeSegment=nodeIterator.next();
   if (nodeSegment instanceof Tag) {
     Tag tag=(Tag)nodeSegment;
     // HANDLE TAG
     // Uncomment the following line to ensure each tag is valid XML:
     // writer.write(tag.tidy()); continue;
   } else if (nodeSegment instanceof CharacterReference) {
     CharacterReference characterReference=(CharacterReference)nodeSegment;
     // HANDLE CHARACTER REFERENCE
     // Uncomment the following line to decode all character references instead of copying them verbatim:
     // characterReference.appendCharTo(writer); continue;
   } else {
     // HANDLE PLAIN TEXT
   }
   // unless specific handling has prevented getting to here, simply output the segment as is:
   //writer.write(nodeSegment.toString());
 }

在// HANDLE CHARACTER REFERENCE 和// HANDLE PLAIN TEXT 中是您要添加字符串附加代码的地方。

【讨论】：