【发布时间】:2018-02-18 20:17:48
【问题描述】:
我正在尝试从https://www.groupon.pl/deals/ga-hotel-alpin-17 站点读取标题(这是此特定站点特有的问题)
address = "https://www.groupon.pl/deals/ga-hotel-alpin-17";
URL url = new URL(address);
URLConnection httpcon = url.openConnection();
httpcon.setConnectTimeout(5000);
httpcon.setReadTimeout(5000);
httpcon.addRequestProperty("User-Agent", "Mozilla/4.0");
response = httpcon.getInputStream();
Scanner scanner = new Scanner(response);
String responseBody = scanner.useDelimiter("\\A").next();
String title = responseBody.substring(responseBody.toUpperCase().indexOf("<TITLE>") + 7, responseBody.toUpperCase().indexOf("</TITLE>"));
我收到 403 或 SocketTimeoutException:
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
获取此站点没有问题,例如使用简单的wget 命令。
我怀疑服务器不希望被 Java 查询,但为什么设置用户代理没有帮助?还可以做些什么来假装真实的浏览器行为?有什么想法吗?
【问题讨论】:
-
没有
ReadTimeoutException这样的例外。阅读堆栈跟踪。您的读取超时时间太短。很明显。 -
不完全...如果我不设置超时,那么我会等待太久,我尝试了 60 秒,但仍然是同样的问题...
标签: java user-agent urlconnection