【发布时间】:2014-02-24 16:45:49
【问题描述】:
我正在尝试编写一个小函数来将 HTML 文档的隐式部分包装到部分标签中。我正在尝试使用 lxml.etree。
假设我的输入是:
<html>
<head></head>
<body>
<h1>title</h1>
<p>some text</p>
<h1>title</h1>
<p>some text</p>
</body>
</html>
我想结束:
<html>
<head></head>
<body>
<section>
<h1>title</h1>
<p>some text</p>
</section>
<section>
<h1>title</h1>
<p>some text</p>
</section>
</body>
</html>
这是我目前拥有的
def outline(tree):
pattern = re.compile('^h(\d)')
section = None
for child in tree.iterchildren():
tag = child.tag
if tag is lxml.etree.Comment:
continue
match = pattern.match(tag.lower())
# If a header tag is found
if match:
depth = int(match.group(1))
if section is not None:
child.addprevious(section)
section = lxml.etree.Element('section')
section.append(child)
else:
if section is not None:
section.append(child)
else:
pass
if child is not None:
outline(child)
我是这样称呼的
outline(tree.find('body'))
但目前它不适用于副标题,例如:
<section>
<h1>ONE</h1>
<section>
<h3>TOO Deep</h3>
</section>
<section>
<h2>Level 2</h2>
</section>
</section>
<section>
<h1>TWO</h1>
</section>
谢谢
【问题讨论】:
标签: python html lxml elementtree