【问题标题】:How to merge/collapse child DOM node into parent with BeautifulSoup / lxml?如何使用 BeautifulSoup / lxml 将子 DOM 节点合并/折叠到父节点?
【发布时间】:2019-01-23 08:47:47
【问题描述】:

我正在编写一些 HTML 预处理脚本,这些脚本从网络爬虫中清理/标记 HTML,用于随后的语义/链接分析步骤。我已从 HTML 中过滤掉不需要的标签,并将其简化为仅包含可见文本和 <div> / <a> 元素。

我现在正在尝试编写一个“collapseDOM()”函数来遍历 DOM 树并执行以下操作:

(1) 销毁没有任何可见文本的叶子节点

(2) 折叠任何<div>,并将其替换为其子级,如果它 (a) 直接不包含可见文本并且 (b) 只有一个 <div> 子级

例如,如果我有以下 HTML 作为输入:

<html>
<body>
    <div>
        <div>
             <a href="www.foo.com">not collapsed into empty parent: only divs</a>
        </div>
    </div>

    <div>
        <div>
            <div>
                inner div not collapsed because this contains text 
                <div>some more text ...</div>
                but the outer nested divs do get collapsed
            </div>
        </div>
    </div>

    <div>
        <div>This won't be collapsed into parent because </div>
        <div>there are two children ...</div>
    </div>

</body>

它应该变成这个“折叠”的版本:

<html>
<body>
    <div>
         <a href="www.foo.com">not collapsed into empty parent: only divs</a>
    </div>

    <div>
        inner div not collapsed because this contains text 
        <div>some more text ...</div>
        but the outer nested divs do get collapsed
    </div>


    <div>
        <div>This won't be collapsed into parent because </div>
        <div>there are two children ...</div>
    </div>

</body>

我一直无法弄清楚如何做到这一点。我尝试使用 BeautifulSoup 的 unwrap()decompose() 方法编写递归树遍历函数,但这在迭代它时修改了 DOM,我无法弄清楚如何让它工作......

有没有一种简单的方法来做我想做的事?我对 BeautifulSoup 或 lxml 中的解决方案持开放态度。谢谢!

【问题讨论】:

    标签: html dom merge beautifulsoup lxml


    【解决方案1】:

    您可以从这里开始,然后根据自己的需要进行调整:

    def stripTagWithNoText(soup):
    
    def remove(node):
        for index, item in enumerate(node.contents):
            if not isinstance(item, NavigableString):
                currentNodes = [text for text in item.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and  len(re.sub('[\s+]', '', text)) > 0)]
                parentNodes = [text for text in item.parent.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and  len(re.sub('[\s+]', '', text)) > 0)]
    
                if len(currentNodes) == 1 and item.name == item.parent.name:
                    if len(parentNodes) > 1:
                        continue
                    if item.name == currentNodes[0].name and len(currentNodes) == 1:
                        item.replaceWithChildren()
                    node.unwrap()
    
    
    for tag in soup.find_all():
        remove(tag)
    print(soup)
    
    soup = BeautifulSoup(data, "lxml")
    stripTagWithNoText(soup)
    

    <html> <body> <div> <a href="www.foo.com">not collapsed into empty parent: only divs</a> </div> <div> inner div not collapsed because this contains text <div>some more text ...</div> but the outer nested divs do get collapsed </div> <div> <div>This won't be collapsed into parent because </div> <div>there are two children ...</div> </div> </body> </html>

    【讨论】:

    • 实际上,这个测试用例似乎失败了(不折叠嵌套的 div):&lt;html&gt;&lt;body&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;This is deep down in the divs&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
    猜你喜欢
    • 2014-06-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-09-23
    • 2011-09-11
    相关资源
    最近更新 更多