如何只解析 HTML 中的文本答案

【问题标题】：How to Parse Only Text from HTML如何只解析 HTML 中的文本
【发布时间】：2010-08-18 07:14:11
【问题描述】：

如何使用 java 使用 jsoup 仅解析网页中的文本？

【问题讨论】：

【解决方案1】：

来自 jsoup 食谱：http://jsoup.org/cookbook/extracting-data/attributes-text-html

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"

【讨论】：

如何排除不可见元素？（例如显示：无）

【解决方案2】：

使用 JDK 中的类：

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;

class GetHTMLText
{
    public static void main(String[] args)
        throws Exception
    {
        EditorKit kit = new HTMLEditorKit();
        Document doc = kit.createDefaultDocument();

        // The Document class does not yet handle charset's properly.
        doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);

        // Create a reader on the HTML content.

        Reader rd = getReader(args[0]);

        // Parse the HTML.

        kit.read(rd, doc, 0);

        //  The HTML text is now stored in the document

        System.out.println( doc.getText(0, doc.getLength()) );
    }

    // Returns a reader on the HTML data. If 'uri' begins
    // with "http:", it's treated as a URL; otherwise,
    // it's assumed to be a local filename.

    static Reader getReader(String uri)
        throws IOException
    {
        // Retrieve from Internet.
        if (uri.startsWith("http:"))
        {
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
        }
        // Retrieve from file.
        else
        {
            return new FileReader(uri);
        }
    }
}

【讨论】：

【解决方案3】：

嗯，这是我曾经拼凑的一个快速方法。它使用正则表达式来完成工作。大多数人会同意这不是一个好方法。所以，使用风险自负。

public static String getPlainText(String html) {
    String htmlBody = html.replaceAll("<hr>", ""); // one off for horizontal rule lines
    String plainTextBody = htmlBody.replaceAll("<[^<>]+>([^<>]*)<[^<>]+>", "$1");
    plainTextBody = plainTextBody.replaceAll("<br ?/>", "");
    return decodeHtml(plainTextBody);
}

这最初是在我的 Stack Overflow API 的 API 包装器中使用的。因此，它只在一小部分 html 标签下进行了测试。

【讨论】：

嗯...你为什么不使用简单的正则表达式：replaceAll("<[^>]+>", "")？
@Crozin，我在自学如何使用我猜想的反向引用。看起来你的可能也可以。
这很痛！ -> stackoverflow.com/questions/1732348/…
@sleep，我很清楚用正则表达式解析 html 可能是一个糟糕的主意。但有时它实际上是一个不错的选择。我提到他们应该自担风险使用它。
@jjnguy: :) - 只是为了好玩