仅在带有特定文本的标签之后查找特定类的所有标签答案

【问题标题】：Find all tags of certain class only after tag with certain text仅在带有特定文本的标签之后查找特定类的所有标签
【发布时间】：2015-10-02 00:43:32
【问题描述】：

我在 HTML 中有一个很大的长表，因此标签不会相互嵌套。它看起来像这样：

<tr>
    <td>A</td>
</tr>
<tr>
    <td class="x">...</td>
    <td class="x">...</td>
    <td class="x">...</td>
    <td class="x">...</td>
</tr>
<tr>
    <td class ="y">...</td>
    <td class ="y">...</td>
    <td class ="y">...</td>
    <td class ="y">...</td>
</tr>
<tr>
    <td>B</td>
</tr>
<tr>
    <td class="x">...</td>
    <td class="x">...</td>
    <td class="x">...</td>
    <td class="x">...</td>
</tr>
<tr>
    <td class ="y">I want this</td>
    <td class ="y">and this</td>
    <td class ="y">and this</td>
    <td class ="y">and this</td>
</tr>

所以首先我想搜索树以找到“B”。然后我想在 B 之后但在表格的下一行以“C”重新开始之前获取每个 td 标记的文本，其中 y 类。

我试过这个：

results = soup.find_all('td')
for result in results:
    if result.string == "B":
        print(result.string)

这得到了我想要的字符串 B。但现在我试图在这之后找到所有东西，但我没有得到我想要的。

for results in soup.find_all('td'):
    if results.string == 'B':
        a = results.find_next('td',class_='y')

这给了我在“B”之后的下一个 td，这是我想要的，但我似乎只能得到第一个 td 标记。我想在“B”之后但在“C”之前获取所有具有 y 类的标签（C 未显示在 html 中，但遵循相同的模式），并且我想将其添加到列表中。

我的结果列表是：

[['I want this'],['and this'],['and this'],['and this']]

【问题讨论】：

标签： python html beautifulsoup html-parsing

【解决方案1】：

基本上，您需要找到包含B 文本的元素。这是您的起点。

然后，使用 find_next_siblings() 检查该元素的每个 tr 兄弟：

start = soup.find("td", text="B").parent
for tr in start.find_next_siblings("tr"):
    # exit if reached C
    if tr.find("td", text="C"):
        break

    # get all tds with a desired class
    tds = tr.find_all("td", class_="y")
    for td in tds:
        print(td.get_text())

在您的示例数据上进行测试，它会打印：

I want this
and this
and this
and this

【讨论】：

感谢您的回复。这个对我有用。但是，我很幸运，因为每次我需要的是兄弟姐妹中的最后一个。由于我真的不知道'C'会是什么，并且宁愿他是动态的，我怎样才能使它变得更好，所以无论如何它都可以工作。因此，如果文本是“C”，而不是打破循环的迭代，我如何检查它是否不等于“B”。