如何在 BeautifulSoup find_all ResultSet 之外继续过滤？答案

【问题标题】：How to continue filtering beyond BeautifulSoup find_all ResultSet?如何在 BeautifulSoup find_all ResultSet 之外继续过滤？
【发布时间】：2020-06-28 16:38:48
【问题描述】：

想象一下，您正尝试使用 bs4 解析类似的内容：

<table>
    <tbody>
        <tr>
            <th attr="attr" class="title">
                <a href="link.com/arhwth">Title Text</a>
            </th>

            <th attr="attr" class="title">
                <a href="link.com/dfdsth">Title Text 2</a>
            </th>

            <th attr="attr" class="title">
                <a href="link.com/gsfbf">Title Text 3</a>
            </th>
        </tr>
    </tbody>
    <a href"otherlink.com">Other link to throw you off</a>
</table>

目前我可以通过

获得所有th 元素的列表

html_content = BeautifulSoup(requests.get("parsingwebsite.com").content, "html.parser")

res = html_content.find_all("th", {"attr": "attr"}, class_="title")

但我只想要<a> 中的标题文本。（最好是["Title Text", "Title Text 2", "Title Text 3"]）

有没有办法继续通过 html 元素向下过滤或以其他方式修改原始查询以过滤到链接内的文本，而不必使用正则表达式？

【问题讨论】：

标签： python html python-3.x beautifulsoup

【解决方案1】：

您可以使用 CSS 选择器在特定的 <th> 标签下选择 <a> 标签。

例如th[attr="attr"].title a将选择<th>标签下的所有<a>标签attr="attr"和class="title"：

txt = '''<table>
    <tbody>
        <tr>
            <th attr="attr" class="title">
                <a href="link.com/arhwth">Title Text</a>
            </th>

            <th attr="attr" class="title">
                <a href="link.com/dfdsth">Title Text 2</a>
            </th>

            <th attr="attr" class="title">
                <a href="link.com/gsfbf">Title Text 3</a>
            </th>
        </tr>
    </tbody>
    <a href"otherlink.com">Other link to throw you off</a>
</table>'''

soup = BeautifulSoup(txt, 'html.parser')

print([a.text for a in soup.select('th[attr="attr"].title a')])

打印：

['Title Text', 'Title Text 2', 'Title Text 3']

或者使用 BeautifulSoup 自己的 API：

print( [th.a.text for th in soup.find_all("th", {"attr": "attr"}, class_="title") if th.a] )

【讨论】：

太棒了！我不知道bs4 是否支持 lxml，或者您可以像后一个示例中那样进行过滤

【解决方案2】：

你可以试试这个：

import requests
from bs4 import BeautifulSoup

html = '''<table>    <tbody>
        <tr>
            <th attr="attr" class="title">
                <a href="link.com">Title Text</a>
            </th>

            <th attr="attr" class="title">
                <a href="link.com">Title Text 2</a>
            </th>

            <th attr="attr" class="title">
                <a href="link.com">Title Text 3</a>
            </th>
        </tr>
    </tbody>
</table>'''


html_code = BeautifulSoup(html, 'html.parser')

a = html_code.find_all('a')
text_a = [i.text for i in a]

print(text_a)

【讨论】：

对不起，我更新了我的帖子，因为我忘了包含页面上实际上有随机链接的边缘情况