如何提取特定标题后的 HTML 表格？答案

【问题标题】：How to extract HTML table following a specific heading?如何提取特定标题后的 HTML 表格？
【发布时间】：2019-01-26 21:24:30
【问题描述】：

我正在使用 BeautifulSoup 来解析 HTML 文件。我有一个类似这样的 HTML 文件：

<h3>Unimportant heading</h3>
<table class="foo">
  <tr>
    <td>Key A</td>
  </tr>
  <tr>
    <td>A value I don't want</td>
  </tr>
</table>


<h3>Unimportant heading</h3>
<table class="foo">
  <tr>
    <td>Key B</td>
  </tr>
  <tr>
    <td>A value I don't want</td>
  </tr>
</table>


<h3>THE GOOD STUFF</h3>
<table class="foo">
  <tr>
    <td>Key C</td>
  </tr>
  <tr>
    <td>I WANT THIS STRING</td>
  </tr>
</table>


<h3>Unimportant heading</h3>
<table class="foo">
  <tr>
    <td>Key A</td>
  </tr>
  <tr>
    <td>A value I don't want</td>
  </tr>
</table>

我想提取字符串“I WANT THIS STRING”。完美的解决方案是获得在名为“THE GOOD STUFF”的 h3 标题之后的第一个表格。我不知道如何使用 BeautifulSoup 执行此操作 - 我只知道如何提取具有特定类的表，或 嵌套在某个特定标签内的表，但不知道 following一个特定的标签。

我认为后备解决方案可以使用字符串“Key C”，假设它是唯一的（几乎可以肯定是）并且只出现在那个表中，但使用特定的 h3 标题我会感觉更好。

【问题讨论】：

标签： python python-3.x beautifulsoup html-parsing

【解决方案1】：

按照@Zroq 的answer 在另一个问题上的逻辑，此代码将为您提供您定义的标题后面的表格（“好东西”）。请注意，我只是将您所有的 html 放在名为“html”的变量中。

from bs4 import BeautifulSoup, NavigableString, Tag

soup=BeautifulSoup(html, "lxml")

for header in soup.find_all('h3', text=re.compile('THE GOOD STUFF')):
    nextNode = header
    while True:
        nextNode = nextNode.nextSibling
        if nextNode is None:
            break
        if isinstance(nextNode, Tag):
            if nextNode.name == "h3":
                break
            print(nextNode)

输出：

<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>

干杯！

【讨论】：

谢谢！下一个兄弟姐妹就是我要找的人

【解决方案2】：

docs 说明如果不想使用find_all，可以这样做：

for sibling in soup.a.next_siblings:
    print(repr(sibling))

【讨论】：

【解决方案3】：

我相信有很多方法可以更有效地做到这一点，但这是我现在能想到的：

from bs4 import BeautifulSoup
import os
os.chdir('/Users/Downloads/')
html_data = open("/Users/Downloads/train.html",'r').read()
soup = BeautifulSoup(html_data, 'html.parser')
all_td = soup.find_all("td")
flag = 'no_print'
for td in all_td:
    if flag == 'print':
        print(td.text)
        break
    if td.text == 'Key C':
        flag = 'print'

输出：

I WANT THIS STRING

【讨论】：