BeatifulSoup4 get_text 仍然有 javascript答案

【问题标题】：BeatifulSoup4 get_text still has javascriptBeatifulSoup4 get_text 仍然有 javascript
【发布时间】：2014-05-13 01:25:57
【问题描述】：

我正在尝试使用 bs4 删除所有 html/javascript，但是，它并没有摆脱 javascript。我仍然在文本中看到它。我该如何解决这个问题？

我尝试使用 nltk，它工作正常，但是，clean_html 和 clean_url 将被删除。有没有办法使用汤get_text 并获得相同的结果？

我尝试查看这些其他页面：

BeautifulSoup get_text does not strip all tags and JavaScript

目前我正在使用 nltk 已弃用的功能。

编辑

这是一个例子：

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()

我仍然看到 CNN 的以下内容：

$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});

/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});

如何删除 js？

我发现的只有其他选项：

https://github.com/aaronsw/html2text

html2text 的问题在于它有时真的真的很慢，并且会产生明显的延迟，这是 nltk 一直非常擅长的一件事。

【问题讨论】：

如果我们能看到包含 javascript 的 html 的（一部分），那真的很有帮助

标签： python beautifulsoup nltk

【解决方案1】：

部分基于Can I remove script tags with BeautifulSoup?

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.decompose()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

【讨论】：

而不是script.extract()，最好使用script.decompose()，它只删除而不返回标签对象。
你正在构建很多数据结构，这样你就不必写re.sub("[ \n\r\t]{2,}", " ", text) :)
@badp 特别是你可以说soup.get_text(" ", strip=True) ?
@CsabaToth, @badp，您实际上不想使用strip=True，因为它会导致字符串连接不正确。保存它们很重要，然后使用splitlines，然后清理每个单独的字符串。
@HughBothwell，仍然无法完全停止脚本标签。 here

【解决方案2】：

为了防止最后出现编码错误...

import urllib
from bs4 import BeautifulSoup

url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

【讨论】：

仍然无法完全停止脚本标签例如。 here