jsoup - 去除所有格式和链接标签，只保留文本答案

【问题标题】：jsoup - strip all formatting and link tags, keep text onlyjsoup - 去除所有格式和链接标签，只保留文本
【发布时间】：2012-10-08 06:24:11
【问题描述】：

假设我有一个这样的 html 片段：

<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>

我想从中提取的是：

foo bar foobar baz

所以我的问题是：我怎样才能从 html 中剥离所有包装标签，并且只获取与 html 中相同顺序的文本？正如您在标题中看到的，我想使用 jsoup 进行解析。

重音 html 示例（注意 'á' 字符）：

<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>
<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>

我想要什么：

Tarthatatlan biztonsági viszonyok
Tarthatatlan biztonsági viszonyok

这个 html 不是静态的，通常我只想要一个通用 html 片段的每个文本都以解码的人类可读形式，宽度换行符。

【问题讨论】：

你试过fragment.text()吗？

标签： java html jsoup

【解决方案1】：

使用 Jsoup：

final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
Document doc = Jsoup.parse(html);

System.out.println(doc.text());

输出：

foo bar foobar baz

如果您只想要 p-tag 的文本，请使用它而不是 doc.text()：

doc.select("p").text();

...或只有正文：

doc.body().text();

换行：

final String html = "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>"
        + "<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>";
Document doc = Jsoup.parse(html);

for( Element element : doc.select("p") )
{
    System.out.println(element.text());
    // eg. you can use a StringBuilder and append lines here ...
}

输出：

Tarthatatlan biztonsági viszonyok  
Tarthatatlan biztonsági viszonyok

【讨论】：

谢谢！现在我面临另一个问题：换行符。它可以通过新的 TextNode(elem.html(), "").getWholeText() 来解决，但它使我的特殊重音字符变为 html 编码的字符。如何获取解码后的人类可读字符？
你能发布一些示例 html（你有什么 html 以及你需要什么作为结果）和/或代码。
谢谢！这也是我想出来的，但它猜测某处有更聪明的方法。现在会很好。

【解决方案2】：

使用正则表达式：-

String str = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
str = str.replaceAll("<[^>]*>", "");
System.out.println(str);

输出：-

  foo   bar  foobar  baz

使用 Jsoup：-

Document doc = Jsoup.parse(str); 
String text = doc.text();

【讨论】：

不要使用正则表达式进行 HTML 解析：stackoverflow.com/questions/1732348/…

【解决方案3】：

其实用Jsoup清理的正确方法是通过Whitelist

...
final String html = "<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>";
Document doc = Jsoup.parse(html);
Whitelist wl = Whitelist.none()
String cleanText = Jsoup.clean(doc.html(), wl)

如果你还想保留一些标签：

Whitelist wl = new Whitelist().relaxed().removeTags("a")

【讨论】：

是否有理由不使用静态方法，即 Jsoup.clean(html, WhiteList.none())？
小修复，无法初始化Jsoup类，需要使用Jsoup.clean(html, wl)（需要传递HTML字符串，而不是文档）