如何使用 BeautifulSoup / lxml 将子 DOM 节点合并/折叠到父节点？答案

【问题标题】：How to merge/collapse child DOM node into parent with BeautifulSoup / lxml?如何使用 BeautifulSoup / lxml 将子 DOM 节点合并/折叠到父节点？
【发布时间】：2019-01-23 08:47:47
【问题描述】：

我正在编写一些 HTML 预处理脚本，这些脚本从网络爬虫中清理/标记 HTML，用于随后的语义/链接分析步骤。我已从 HTML 中过滤掉不需要的标签，并将其简化为仅包含可见文本和 <div> / <a> 元素。

我现在正在尝试编写一个“collapseDOM()”函数来遍历 DOM 树并执行以下操作：

(1) 销毁没有任何可见文本的叶子节点

(2) 折叠任何<div>，并将其替换为其子级，如果它 (a) 直接不包含可见文本并且 (b) 只有一个 <div> 子级

例如，如果我有以下 HTML 作为输入：

<html>
<body>
    <div>
        <div>
             <a href="www.foo.com">not collapsed into empty parent: only divs</a>
        </div>
    </div>

    <div>
        <div>
            <div>
                inner div not collapsed because this contains text 
                <div>some more text ...</div>
                but the outer nested divs do get collapsed
            </div>
        </div>
    </div>

    <div>
        <div>This won't be collapsed into parent because </div>
        <div>there are two children ...</div>
    </div>

</body>

它应该变成这个“折叠”的版本：

<html>
<body>
    <div>
         <a href="www.foo.com">not collapsed into empty parent: only divs</a>
    </div>

    <div>
        inner div not collapsed because this contains text 
        <div>some more text ...</div>
        but the outer nested divs do get collapsed
    </div>


    <div>
        <div>This won't be collapsed into parent because </div>
        <div>there are two children ...</div>
    </div>

</body>

我一直无法弄清楚如何做到这一点。我尝试使用 BeautifulSoup 的 unwrap() 和 decompose() 方法编写递归树遍历函数，但这在迭代它时修改了 DOM，我无法弄清楚如何让它工作......

有没有一种简单的方法来做我想做的事？我对 BeautifulSoup 或 lxml 中的解决方案持开放态度。谢谢！

【问题讨论】：

标签： html dom merge beautifulsoup lxml

【解决方案1】：

您可以从这里开始，然后根据自己的需要进行调整：

def stripTagWithNoText(soup):

def remove(node):
    for index, item in enumerate(node.contents):
        if not isinstance(item, NavigableString):
            currentNodes = [text for text in item.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and  len(re.sub('[\s+]', '', text)) > 0)]
            parentNodes = [text for text in item.parent.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and  len(re.sub('[\s+]', '', text)) > 0)]

            if len(currentNodes) == 1 and item.name == item.parent.name:
                if len(parentNodes) > 1:
                    continue
                if item.name == currentNodes[0].name and len(currentNodes) == 1:
                    item.replaceWithChildren()
                node.unwrap()


for tag in soup.find_all():
    remove(tag)
print(soup)

soup = BeautifulSoup(data, "lxml")
stripTagWithNoText(soup)

<html> <body> <div> <a href="www.foo.com">not collapsed into empty parent: only divs</a> </div> <div> inner div not collapsed because this contains text <div>some more text ...</div> but the outer nested divs do get collapsed </div> <div> <div>This won't be collapsed into parent because </div> <div>there are two children ...</div> </div> </body> </html>

【讨论】：

实际上，这个测试用例似乎失败了（不折叠嵌套的 div）：<html><body><div><div><div>This is deep down in the divs</div></div></div></body></html>