如何将自定义 <root> 元素包裹在整个 HTML 文档中？答案

【问题标题】：How to wrap custom <root> element around whole HTML document?如何将自定义 <root> 元素包裹在整个 HTML 文档中？
【发布时间】：2015-04-20 07:21:28
【问题描述】：

我有大量必须转换为 XML 的 HTML 文档。并非所有看起来都完全相同。例如，下面的示例以 HTML 注释标记结束，而不是 HTML 标记。

注意这个问题与this one有关。

这是我的代码：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
        ...
        <comment>here is a comment inside the head tag</comment>
</head>
<body>
        ...
        <comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>

我希望用一个名为<root> 的自定义标签来包装整个文档。到目前为止，我能做的最好的事情就是将<root> 包裹在<html> 周围。

root_tag = bs4.Tag(name="root")
soup.html.wrap(root_tag)

如何定位 <root> 元素以使其包裹整个文档？

【问题讨论】：

每个 HTML 文档都有自己的文件吗？或者这是内存中的抓取数据
每个 HTML 文档都是自己的文件

标签： python html xml beautifulsoup

【解决方案1】：

有点粗略，因为这只是将任何给定文件包装在<root> </root>中

看看它是否适用于您的用例：

def root_wrap(file):
    fin = open(file, 'r+')
    fin.write('<root>')
    for line in fin:
        fin.write(line)
    fin.write('</root>')
    fin.close()

【讨论】：