如何在 Java 中将 HTML 解析为文本答案

【问题标题】：How to parse HTML into text in Java如何在 Java 中将 HTML 解析为文本
【发布时间】：2014-03-13 05:42:29
【问题描述】：

我有一个程序可以连接到互联网并读取文件。但它的结果总是在 html.. 我怎么可能把它变成普通文本..

代码如下：

package urlconnectionreader;
import java.net.*;
import java.io.*;

public class URLConnectionReader {

    public static void main(String[] args) throws IOException{
        System.out.println("Hi!");
        URL oracle = new URL("http://www.oracle.com/");
        URLConnection yc = oracle.openConnection();
        yc.setRequestProperty("Content-type", "text/xml");
        yc.setRequestProperty("Accept", "text/xml, application/xml");
        BufferedReader in = new BufferedReader(new InputStreamReader(
                                    yc.getInputStream()));
        String inputLine;
        while ((inputLine = in.readLine()) != null) 
            System.out.println(inputLine);
        in.close();

    }
}

--已编辑--

我想去掉其输出中的 html 标签。

【问题讨论】：

嗯，HTML 是文本最终。你到底想做什么？
我的意思是，我想在没有 html 标签的情况下显示它。

标签： java html text

【解决方案1】：

您可以使用 jsoup library 非常容易地记录和优化以获取和解析 html 文件。

【讨论】：

【解决方案2】：

您可以为此使用 Jsoup。 http://jsoup.org/

要将整个文档作为对象模型，

Document doc = Jsoup.connect("http://www.oracle.com/").get();

如果您只想解析代码中的任何 html，

String html = "";
while((inputLine = in.readLine()) != null){
    html = html + inputLine; // better use StringBuilder instead.
}

String text = Jsoup.parse(html).body().text();

【讨论】：

【解决方案3】：

您的代码运行良好，System.out.println(inputLine);确实在文本中打印了 html，您可以将输出分配为字符串，以便您以后可以使用它。

像这样：

    package urlconnectionreader;
    import java.net.*;
    import java.io.*;

    public class URLConnectionReader {

        public static void main(String[] args) throws IOException{
            System.out.println("Hi!");
            URL oracle = new URL("http://www.sarawak.gov.my/");
            URLConnection yc = oracle.openConnection();
            yc.setRequestProperty("Content-type", "text/html");
            yc.setRequestProperty("Accept", "text/html, application/html");
            BufferedReader in = new BufferedReader(new InputStreamReader(
                                        yc.getInputStream()));
            String inputLine;
            String strtempHtml = "";
            while ((inputLine = in.readLine()) != null) {

                strtempHtml = strtempHtml+inputLine;
            }
            in.close();
            String noHTMLString = strtempHtml.replaceAll("\\<.*?>",""); 
            System.out.println(noHTMLString ); //html tag removed
        }

}

【讨论】：

我的意思是，我想摆脱 html 标签。
String noHTMLString = htmlString.replaceAll("\\<.>","");