如何将 html 表格切割成单独的表格？答案

【问题标题】：How I can cut html table into separate tables?如何将 html 表格切割成单独的表格？
【发布时间】：2013-06-14 07:32:45
【问题描述】：

我有一个由 JSoup 解析的 html 文档。此表中有几行：

<table>
    <tbody>
        <tr>...</tr>
        <tr>...</tr>
        <tr>...</tr>
        <tr>...</tr>
        <tr>...</tr>
        <tr>...</tr>
    </tbody>
</table>

有些行是一种标题 - 我使用 Jsoup select(...) 方法找到这些行。所以我有 Elements 对象，其中包含所有作为标题的行。假设它看起来像这样：

<table>
    <tbody>
        <tr id="tr1">...</tr> 
        <tr id="tr2">...</tr> // this is header
        <tr id="tr3">...</tr>
        <tr id="tr4">...</tr>
        <tr id="tr5">...</tr> // this is header
        <tr id="tr6">...</tr>
    </tbody>
</table>

Id 属性仅用于此示例 - 在实际情况下，解析的 html 中没有 id 属性。

我需要的是获取 2 个表（包含每个表的 2 个 Element 对象），每个表头一个，包含给定表头下方但下一个表头上方的所有行。所以我期望：

<table> // Element 1
    <tbody>
        <tr id="tr3">...</tr>
        <tr id="tr4">...</tr>
    </tbody>
</table>

<table> // Element 2
    <tbody>
        <tr id="tr6">...</tr>
    </tbody>
</table>

谁能帮我完成这项任务？

【问题讨论】：

???你什么意思？我对已解析文档的布局没有影响，我必须对其进行转换...
jsoup 解析 HTML。它不会操纵结构。

标签： java html jsoup

【解决方案1】：

这是一个很好的测试 JSoup 处理 dom 能力的练习。下面是你需要的sn-p。代码几乎不言自明（createElement 创建一个元素等等），但如果您需要任何说明，请告诉我：

Elements tables = new Elements();
for (Element headerTR : headerRows) {
    Element tbody = doc.createElement("tbody");
    Element firstSiblingTR = headerTR.nextElementSibling();
    if (firstSiblingTR != null) {
        Element secondSiblingTR = firstSiblingTR.nextElementSibling();
        tbody.appendChild(firstSiblingTR);
        if (secondSiblingTR != null) {
            tbody.appendChild(secondSiblingTR);
        }
    }
    Element table = doc.createElement("table");
    table.appendChild(tbody);
    tables.add(table);
}

示例用法：

public static void main(String[] args) {
    Document doc = Jsoup.parse("<html><body>"+
    "<table>" +
    "  <tbody>" +
    "    <tr><td>1</td></tr>" +
    "    <tr class='header'><td>2</td></tr>" + // class added to simulate ur list
    "    <tr><td>3</td></tr>" +
    "    <tr><td>4</td></tr>" +
    "    <tr class='header'><td>5</td></tr>" + // class added to simulate ur list
    "    <tr><td>6</td></tr>" +
    "  </tbody>" +
    "</table>" +
    "</body></html>");

    Elements headerRows = doc.getElementsByClass("header"); // simulating ur list

    Elements tables = new Elements();
    for (Element headerTR : headerRows) {
        Element tbody = doc.createElement("tbody");
        Element firstSiblingTR = headerTR.nextElementSibling();
        if (firstSiblingTR != null) {
            Element secondSiblingTR = firstSiblingTR.nextElementSibling();
            tbody.appendChild(firstSiblingTR);
            if (secondSiblingTR != null) {
                tbody.appendChild(secondSiblingTR);
            }
        }
        Element table = doc.createElement("table");
        table.appendChild(tbody);
        tables.add(table);
    }
    System.out.println(tables); // print <table> list
}

输出：

<table>
    <tbody>
        <tr><td>3</td></tr>
        <tr><td>4</td></tr>
    </tbody>
</table>
<table>
    <tbody>
        <tr><td>6</td></tr>
    </tbody>
</table>

【讨论】：

谢谢，它看起来不错，但您只检查每个标题后的两个兄弟姐妹。我不知道两个连续的标题行之间有多少行。所以我必须在循环中检查每个标题行的兄弟姐妹，以确定给定的兄弟姐妹是否不是下一个标题行....
我找到了我需要的方法 - Element.elementSiblingIndex() 它返回兄弟姐妹表中的索引。所以现在很容易比较每一行的索引。