【问题标题】：Using Java to pull data from a webpage?使用 Java 从网页中提取数据？
【发布时间】：2023-03-09 20:11:01
【问题描述】：

我正在尝试用 Java 编写我的第一个程序。目标是编写一个程序来浏览网站并为我下载文件。但是，我不知道如何使用 Java 与互联网交互。谁能告诉我要查找/阅读哪些主题或推荐一些好的资源？

【问题讨论】：

你可以使用 Apache 的HttpClient。有点类似的答案here

标签： java

【解决方案1】：

最简单的解决方案（不依赖于任何第三方库或平台）是创建一个指向您要下载的网页/链接的 URL 实例，并使用流读取内容。

例如：

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;


public class DownloadPage {

    public static void main(String[] args) throws IOException {

        // Make a URL to the web page
        URL url = new URL("http://stackoverflow.com/questions/6159118/using-java-to-pull-data-from-a-webpage");

        // Get the input stream through URL Connection
        URLConnection con = url.openConnection();
        InputStream is =con.getInputStream();

        // Once you have the Input Stream, it's just plain old Java IO stuff.

        // For this case, since you are interested in getting plain-text web page
        // I'll use a reader and output the text content to System.out.

        // For binary content, it's better to directly read the bytes from stream and write
        // to the target file.


        BufferedReader br = new BufferedReader(new InputStreamReader(is));

        String line = null;

        // read each line and write to System.out
        while ((line = br.readLine()) != null) {
            System.out.println(line);
        }
    }
}

希望这会有所帮助。

【讨论】：

嗨，当我实现这一点时，我在控制台中获得了 html 文件。如何从网站获得特定价值

【解决方案2】：

基础知识

看看这些，或多或少地从头开始构建解决方案：

从基础开始：The Java Tutorial的chapter on Networking，包括Working With URLs
让自己更轻松：Apache HttpComponents（包括 HttpClient）

易于粘合和缝合的东西

您始终可以选择使用exec() 和类似方法从Java 调用外部工具。例如，您可以使用wget 或cURL。

硬核的东西

然后，如果您想研究更成熟的东西，谢天谢地，自动化网络测试的需求为我们提供了非常实用的工具。看：

HtmlUnit（强大而简单）
Selenium, Selenium-RC
WebDriver/Selenium2（仍在制作中）
JBehave 和 JBehave Web

其他一些库是故意编写的，考虑到网络抓取：

JSoup
Jaunt

一些解决方法

Java 是一种语言，也是一种平台，上面运行着许多其他语言。其中一些集成了出色的语法糖或库以轻松构建抓取工具。

退房：

Groovy（及其XmlSlurper）
或Scala（提供强大的XML支持here和here）

如果您知道 Ruby（JRuby，带有 article on scraping with JRuby and HtmlUnit）或 Python（Python（Jython）的出色库，或者您更喜欢这些语言，那么给他们的 JVM 端口一个机会。

一些补充

其他一些类似的问题：

【讨论】：

在那个答案中我没有写一些东西：我真的不建议在 Java 中做这种事情（当然，你可能别无选择，但我只是指出它出去）。这是可行的，并且有很多工具可以做到这一点，但是 Java 固有的冗长使得尝试废弃的 Web 服务变得不那么友好。通常，我宁愿从带有 REPL 的动态语言中执行此操作，或者直接从浏览器的控制台等执行此操作……但是，当然，没有什么能阻止您从那样开始，然后在 Java 中实现解决方案……或其他基于JVM的语言！

【解决方案3】：

这是我使用URL 和try with resources 短语来捕获异常的解决方案。

/**
 * Created by mona on 5/27/16.
 */
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
public class ReadFromWeb {
    public static void readFromWeb(String webURL) throws IOException {
        URL url = new URL(webURL);
        InputStream is =  url.openStream();
        try( BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
            String line;
            while ((line = br.readLine()) != null) {
                System.out.println(line);
            }
        }
        catch (MalformedURLException e) {
            e.printStackTrace();
            throw new MalformedURLException("URL is malformed!!");
        }
        catch (IOException e) {
            e.printStackTrace();
            throw new IOException();
        }

    }
    public static void main(String[] args) throws IOException {
        String url = "https://madison.craigslist.org/search/sub";
        readFromWeb(url);
    }

}

您还可以根据需要将其保存到文件中，或使用XML 或HTML 库对其进行解析。

【讨论】：

【解决方案4】：

自 Java 11 以来，它使用标准库中的java.net.http.HttpClient 是最方便的方式。

例子：

HttpClient client = HttpClient.newBuilder()
     .version(Version.HTTP_1_1)
     .followRedirects(Redirect.NORMAL)
     .connectTimeout(Duration.ofSeconds(20))
     .proxy(ProxySelector.of(new InetSocketAddress("proxy.example.com", 80)))
     .authenticator(Authenticator.getDefault())
     .build();

HttpRequest request = HttpRequest.newBuilder()
     .uri(URI.create("httpss://foo.com/"))
     .timeout(Duration.ofMinutes(2))
     .GET()
     .build();

HttpResponse<String> response = client.send(request, BodyHandlers.ofString());

System.out.println(response.statusCode());

System.out.println(response.body());

【讨论】：

【解决方案5】：

我的 API 使用以下代码：

try {
        URL url = new URL("https://stackoverflow.com/questions/6159118/using-java-to-pull-data-from-a-webpage");
        InputStream content = url.openStream();
        int c;
        while ((c = content.read())!=-1) System.out.print((char) c);
    } catch (MalformedURLException e) {
        e.printStackTrace();
    } catch (IOException ie) {
        ie.printStackTrace();
    }

您可以捕获字符并将它们转换为字符串。

【讨论】：