html格式的文字去掉html tag转为纯text文字

1，使用lxml：

import lxml.etree
import lxml.html


with open('/tmp/hzh/a.html', 'r') as file:
    data = file.read()
html_str = '<p>hzh。</p>   \n  <p> l1</p>'
root = lxml.html.fromstring(html_str)

# optionally remove tags that are not usually rendered in browsers
# javascript, HTML/HEAD, comments, add the tag names you dont want at the end
lxml.etree.strip_elements(root, lxml.etree.Comment, "script", "head")

# complete text
result_str = lxml.html.tostring(root, method="text", encoding='unicode')
print(result_str)

2，使用xpath的string()格式：

参考文章见1和2.