使用美丽的汤从各种标签中提取标题答案

【问题标题】：Extract heading from various tags using beautiful soup使用美丽的汤从各种标签中提取标题
【发布时间】：2019-11-07 04:08:40
【问题描述】：

如何使用漂亮的汤从下面的 html 中提取两种表格类型的表格标题

<body>
    <p>some other data 1</p>
    <p>Table1 heading</p>
    <div></div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data1_00</p></td>
                <td><p>data1_01</p></td>
            </tr>
            <tr>
                <td><p>data1_10</p></td>
                <td><p>data1_11</p></td>
            </tr>
        </tbody></table></div>
    </div>

    <br><br>

    <div>some other data 2</div>
    <div>Table2 heading</div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data2_00</p></td>
                <td><p>data2_01</p></td>
            </tr>
            <tr>
                <td><p>data2_10</p></td>
                <td><p>data2_11</p></td>
            </tr>
        </tbody></table></div>
    </div>
</body>

在第一个表中，标题位于<p> 标签内，第二个表标题位于<div> 标签内。同样在第二张桌子上，桌子上方还有一个空白的<div> 标签。
如何提取两个表格标题？

目前我正在使用table.find_previous('div') 搜索当前表格上方的前一个<div>，其中的文本将保存为标题。

from bs4 import BeautifulSoup
import urllib.request

htmlpage = urllib.request.urlopen(url)
    page = BeautifulSoup(htmlpage, "html.parser")
    all_divtables = page.find_all('table')
    for table in all_divtables:
        curr_div = table
        while True:
            curr_div = curr_div.find_previous('div')
            if len(curr_div.find_all('table')) > 0:
                continue
            else:
                heading = curr_div.text.strip()
                print(heading)
                break

想要的输出：
Table1 heading
Table2 heading

【问题讨论】：

你能贴出你的python代码吗？
@Wonka，添加代码
您现在可以发布您想要的输出吗？ find_all("tr") 似乎更好，我会等待你想要的输出知道你想要什么。
检查@Andrej Kesely 的答案，这似乎是一个不错的解决方案。

标签： python html python-3.x beautifulsoup scrapy

【解决方案1】：

您可以使用带有 lambda 参数的find_previous() 函数，该函数选择不包含其他表且不包含空字符串的前一个标签：

data = '''<body>
    <p>some other data 1</p>
    <p>Table1 heading</p>
    <div></div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data1_00</p></td>
                <td><p>data1_01</p></td>
            </tr>
            <tr>
                <td><p>data1_10</p></td>
                <td><p>data1_11</p></td>
            </tr>
        </tbody></table></div>
    </div>

    <br><br>

    <div>some other data 2</div>
    <div>Table2 heading</div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data2_00</p></td>
                <td><p>data2_01</p></td>
            </tr>
            <tr>
                <td><p>data2_10</p></td>
                <td><p>data2_11</p></td>
            </tr>
        </tbody></table></div>
    </div>

    <div>some other data 3</div>
    <div>Table3 heading</div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data2_00z</p></td>
                <td><p>data2_01z</p></td>
            </tr>
            <tr>
                <td><p>data2_10z</p></td>
                <td><p>data2_11z</p></td>
            </tr>
        </tbody></table></div>
    </div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data2_00x</p></td>
                <td><p>data2_01x</p></td>
            </tr>
            <tr>
                <td><p>data2_10x</p></td>
                <td><p>data2_11x</p></td>
            </tr>
        </tbody></table></div>
    </div>

</body>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

for table in soup.select('table'):
    for i in table.find_previous(lambda t: not t.find('table') and t.text.strip() != ''):
        if i.find_parents('table'):
            continue
        print(i)
        print('*' * 80)

打印：

Table1 heading
********************************************************************************
Table2 heading
********************************************************************************
Table3 heading
********************************************************************************

【讨论】：

谢谢，它适用于所有情况，除了两个连续的表格 - 如果第二个表格没有任何标题。
@Shijith 更新了两个连续的表同一个标题的情况。

【解决方案2】：

urldata='''<body>
    <p>some other data 1</p>
    <p>Table1 heading</p>
    <div></div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data1_00</p></td>
                <td><p>data1_01</p></td>
            </tr>
            <tr>
                <td><p>data1_10</p></td>
                <td><p>data1_11</p></td>
            </tr>
        </tbody></table></div>
    </div>

    <br><br>

    <div>some other data 2</div>
    <div>Table2 heading</div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data2_00</p></td>
                <td><p>data2_01</p></td>
            </tr>
            <tr>
                <td><p>data2_10</p></td>
                <td><p>data2_11</p></td>
            </tr>
        </tbody></table></div>
    </div>
</body>'''

import re
from bs4 import BeautifulSoup
import urllib.request
soup = BeautifulSoup(data, 'lxml')

results =soup.body.findAll(text=re.compile('heading'))
for result in results:
    print(result)

**Output:-**

Table1 heading
Table2 heading

【讨论】：

标题可能包含也可能不包含文本heading。
用你的文字替换文字
我必须用它来抓取网页，所以没有我正在寻找的确切文本。我只知道表格之前的标签（如果有的话）包含标题
那么你可以在 findAll() 上传递任何标题标签，如 thead ，比如这个 soup.body.findAll('thead') 或任何标签