如何从 html 页面中提取没有标记标签的文本内容？ [复制]答案

【问题标题】：How to extract text contents without markup tags from a html page? [duplicate]如何从 html 页面中提取没有标记标签的文本内容？ [复制]
【发布时间】：2019-08-20 10:01:15
【问题描述】：

我只想从 html 页面中提取文本，不包括标记。我怎样才能在python（最好）或java脚本中实现这一点？

对于以下代码：

<div id = #one>
 OneDivision
 <div id = #two>TwoDivision</div>
 <span>SpanElement</span>
</div>

我的输出应该是： OneDivision TwoDivision 跨度元素

【问题讨论】：

crummy.com/software/BeautifulSoup/bs4/doc 可以轻松完成这项工作。

标签： javascript python html css

【解决方案1】：

html_doc = BeautifulSoup(html, 'lxml').body

if html_doc is None:
    return None

for tag in html_doc.select('script'):
    tag.decompose()
for tag in html_doc.select('style'):
    tag.decompose()

text = html_doc.get_text(separator='\n')

【讨论】：

【解决方案2】：

from bs4 import BeautifulSoup
html = '<div id = #one>OneDivision<div id = #two>TwoDivision</div><span>SpanElement</span></div>'
soup = BeautifulSoup(html,"lxml")
print(soup.get_text(separator=' '))

输出

'OneDivision TwoDivision SpanElement'

【讨论】：

【解决方案3】：

超级简单！在 Javascript 中，使用 textContent。见以下代码

console.log(document.getElementById("one").textContent);

<div id = "one">
 OneDivision
 <div id = "two">TwoDivision</div>
 <span>SpanElement</span>
</div>

【讨论】：