【问题标题】:How to read html content from this page with Java?如何使用 Java 从该页面读取 html 内容?
【发布时间】:2016-11-21 17:54:54
【问题描述】:

我的 Java 应用正在尝试从以下 url 读取内容:https://www.iplocation.net/?query=62.92.63.48

我使用了以下方法:

  StringBuffer readFromUrl(String Url)
  {
    StringBuffer sb=new StringBuffer();
    BufferedReader in=null;
    
    try
    {
      in=new BufferedReader(new InputStreamReader(new URL(Url).openStream()));
      String inputLine;
    
      while ((inputLine=in.readLine()) != null) sb.append(inputLine+"\n");
      in.close();
    }
    catch (Exception e) { e.printStackTrace(); }
    finally 
    {
      try 
      {
        if (in!=null)
        {
          in.close();
          in=null;
        }
      }
      catch (Exception ex) { ex.printStackTrace(); }
    }
    return sb;
  }

通常它适用于其他 url,但对于这个,结果与浏览器中显示的不同,它看起来像这样:

<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script>
(function(){function getSessionCookies(){var cookieArray=new Array();var cName=/^\s?incap_ses_/;var c=document.cookie.split(";");for(var i=0;i<c.length;i++){var key=c[i].substr(0,c[i].indexOf("="));var value=c[i].substr(c[i].indexOf("=")+1,c[i].length);if(cName.test(key)){cookieArray[cookieArray.length]=value}}return cookieArray}function setIncapCookie(vArray){var res;try{var cookies=getSessionCookies();var digests=new Array(cookies.length);for(var i=0;i<cookies.length;i++){digests[i]=simpleDigest((vArray)+cookies[i])}res=vArray+",digest="+(digests.join())}catch(e){res=vArray+",digest="+(encodeURIComponent(e.toString()))}createCookie("___utmvc",res,20)}function simpleDigest(mystr){var res=0;for(var i=0;i<mystr.length;i++){res+=mystr.charCodeAt(i)}return res}function createCookie(name,value,seconds){var expires="";if(seconds){var date=new Date();date.setTime(date.getTime()+(seconds*1000));var expires="; expires="+date.toGMTString()}document.cookie=name+"="+value+expires+"; path=/"}function test(o){var res="";var vArray=new Array();for(var j=0;j<o.length;j++){var test=o[j][0];switch(o[j][1]){case"exists":try{if(typeof(eval(test))!="undefined"){vArray[vArray.length]=encodeURIComponent(test+"=true")}else{vArray[vArray.length]=encodeURIComponent(test+"=false")}}catch(e){vArray[vArray.length]=encodeURIComponent(test+"=false")}break;case"value":try{try{res=eval(test);if(typeof(res)==="undefined"){vArray[vArray.length]=encodeURIComponent(test+"=undefined")}else if(res===null){vArray[vArray.length]=encodeURIComponent(test+"=null")}else{vArray[vArray.length]=encodeURIComponent(test+"="+res.toString())}}catch(e){vArray[vArray.length]=encodeURIComponent(test+"=cannot evaluate");break}break}catch(e){vArray[vArray.length]=encodeURIComponent(test+"="+e)}case"plugin_extentions":try{var extentions=[];try{i=extentions.indexOf("i")}catch(e){vArray[vArray.length]=encodeURIComponent("plugin_ext=indexOf is not a function");break}try{var num=navigator.plugins.length if(num==0||num==null){vArray[vArray.length]=encodeURIComponent("plugin_ext=no plugins");break}}catch(e){vArray[vArray.length]=encodeURIComponent("plugin_ext=cannot evaluate");break}for(var i=0;i<navigator.plugins.length;i++){if(typeof(navigator.plugins[i])=="undefined"){vArray[vArray.length]=encodeURIComponent("plugin_ext=plugins[i] is undefined");break}var filename=navigator.plugins[i].filename var ext="no extention";if(typeof(filename)=="undefined"){ext="filename is undefined"}else if(filename.split(".").length>1){ext=filename.split('.').pop()}if(extentions.indexOf(ext)<0){extentions.push(ext)}}for(i=0;i<extentions.length;i++){vArray[vArray.length]=encodeURIComponent("plugin_ext="+extentions[i])}}catch(e){vArray[vArray.length]=encodeURIComponent("plugin_ext="+e)}break}}vArray=vArray.join();return vArray}var o=[["navigator","exists"],["navigator.vendor","value"],["navigator.appName","value"],["navigator.plugins.length==0","value"],["navigator.platform","value"],["navigator.webdriver","value"],["platform","plugin_extentions"],["ActiveXObject","exists"],["webkitURL","exists"],["_phantom","exists"],["callPhantom","exists"],["chrome","exists"],["yandex","exists"],["opera","exists"],["opr","exists"],["safari","exists"],["awesomium","exists"],["puffinDevice","exists"],["navigator.cpuClass","exists"],["navigator.oscpu","exists"],["navigator.connection","exists"],["window.outerWidth==0","value"],["window.outerHeight==0","value"],["window.WebGLRenderingContext","exists"],["document.documentMode","value"],["eval.toString().length","value"]];try{setIncapCookie(test(o));document.createElement("img").src="/_Incapsula_Resource?SWKMTFSR=1&e="+Math.random()}catch(e){img=document.createElement("img");img.src="/_Incapsula_Resource?SWKMTFSR=1&e="+e}})();
</script>
<script>
(function() { 
var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D2273746128......6F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();
</script></head>
<body>
<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>
</body></html>

那么在这种情况下,读取浏览器中显示的 html 内容的正确方法是什么?

编辑:阅读建议后,我将程序更新为如下所示:

StringBuilder response=new StringBuilder();
String USER_AGENT="Mozilla/5.0",inputLine;
BufferedReader in=null;    

try
{
  HttpURLConnection con=(HttpURLConnection)new URL(Url).openConnection();
  con.setRequestMethod("GET");
  con.setRequestProperty("Accept-Charset","UTF-8");
  con.setRequestProperty("User-Agent",USER_AGENT);                         // Add request header

  int responseCode=con.getResponseCode();
  in=new BufferedReader(new InputStreamReader(con.getInputStream()));
  while ((inputLine=in.readLine())!=null) { response.append(inputLine); }
  in.close();
}
catch (Exception e) { e.printStackTrace(); }
finally 
{
  try { if (in!=null) in.close(); }
  catch (Exception ex) { ex.printStackTrace(); }
}
return response.toString();

还是不行,我得到的回复是这样的:

<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=24&xinfo=8-75933493-0 0NNN RT(1479758027223 127) q(0 -1 -1 -1) r(0 -1) B12(4,315,0) U10000&incident_id=516000100118713619-514529209419563176&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 516000100118713619-514529209419563176</iframe></body></html>

有人可以展示一些有效的示例代码吗?

感谢@thatguy,我已将程序修改为如下所示:

import java.util.*;
import java.util.concurrent.*;
import java.io.*;
import java.net.*;
import java.util.Map.Entry;

public class Read_From_Url_Runner implements Callable<String[]>
{
  int Id;
  String Read_From_Url_Result[]=null,IP_Location_Url="https://www.iplocation.net/?query=[IP]",IP="62.92.63.48",Cookie,Result[],A_Url;
  
  public Read_From_Url_Runner(int Id)
  {
    this.Id=Id;
    
    A_Url=IP_Location_Url.replace("[IP]",IP);
    Cookie=getIncapsulaCookie(A_Url);
    Out("Cookie = [ "+Cookie+" ]");
    
    try
    {
      Result=call();
//      for (int i=0;i<Result.length;i++) Out(Result[i]);
    }
    catch (Exception e) { e.printStackTrace(); }
  }
  
  public String[] call() throws InterruptedException
  {
    String Text;
    
    try
    {
      Text=readUrl(A_Url,Cookie);
      Out(Text);
    }
    catch (Exception e)
    {
      Out(" --> Error in data : IP = "+IP);
//    e.printStackTrace();
    }
    return Read_From_Url_Result;
  }
  
  public static String readUrl(String url,String incapsulaCookie)
  {
    StringBuilder response=new StringBuilder();
    String USER_AGENT="Mozilla/5.0",inputLine;
    BufferedReader in=null;

    try
    {
      HttpURLConnection connection=(HttpURLConnection)new URL(url).openConnection();
      connection.setRequestMethod("GET");
      connection.setRequestProperty("Accept","text/html; charset=UTF-8");
      connection.setRequestProperty("User-Agent",USER_AGENT);
      connection.setDoInput(true);
      connection.setDoOutput(true);
      connection.setRequestProperty("Cookie",incapsulaCookie);                           // Set the Incapsula cookie
      Out(connection.getRequestProperty("Cookie"));

      in=new BufferedReader(new InputStreamReader(connection.getInputStream()));
      while ((inputLine=in.readLine())!=null) { response.append(inputLine+"\n"); }
      in.close();
    }
    catch (Exception e) { e.printStackTrace(); }
    finally
    {
      try { if (in!=null) in.close(); }
      catch (Exception ex) { ex.printStackTrace(); }
    }
    return response.toString();
  }
  
  public static String getIncapsulaCookie(String url)
  {
    String USER_AGENT="Mozilla/5.0",incapsulaCookie=null,visid=null,incap=null;          // Cookies for Incapsula, preserve order
    BufferedReader in=null;

    try
    {
      HttpURLConnection cookieConnection=(HttpURLConnection)new URL(url).openConnection();
      cookieConnection.setRequestMethod("GET");
      cookieConnection.setRequestProperty("Accept","text/html; charset=UTF-8");
      cookieConnection.setRequestProperty("User-Agent",USER_AGENT);
      cookieConnection.connect();
      
      for (Entry<String,List<String>> header : cookieConnection.getHeaderFields().entrySet())
      {
        if (header.getKey()!=null && header.getKey().equals("Set-Cookie"))               // Incapsula gives you the required cookies
        {
          for (String cookieValue : header.getValue())                                   // Search for the desired cookies
          {
            if (cookieValue.contains("visid")) visid=cookieValue.substring(0,cookieValue.indexOf(";")+1);
            if (cookieValue.contains("incap_ses")) incap=cookieValue.substring(0,cookieValue.indexOf(";"));
          }
        }
      }
      incapsulaCookie=visid+" "+incap;
      cookieConnection.disconnect();
    }
    catch (Exception e) { e.printStackTrace(); }
    finally
    {
      try { if (in!=null) in.close(); }
      catch (Exception ex) { ex.printStackTrace(); }
    }
    return incapsulaCookie;
  }
  
  private static void out(String message) { System.out.print(message); }
  private static void Out(String message) { System.out.println(message); }
  
  public static void main(String[] args)
  {
    final Read_From_Url_Runner demo=new Read_From_Url_Runner(0);
  }
}

但这仅得到响应的第一部分,如下所示:

我真正想要得到的是如下内容:

这个结果是通过在How to shut down Javafx?运行我的程序得到的

【问题讨论】:

  • 本质上,您需要发出与浏览器相同的请求。您可能可以通过反复试验找出导致标记更改的标头
  • 这可能是我想象的用户代理检查。

标签: java html url


【解决方案1】:

您面临的问题可能本质上是您没有明确设置的HTTP 请求标头。网站通常以不同的表示形式交付,具体取决于 HTTP 标头(和有效负载)中的属性,以便以适当的方式为桌面或移动客户端提供服务。关于你的代码,你没有设置任何东西,所以你发送一个 default 标头,无论库设置什么。如果您检查浏览器发送的具体 HTTP 标头,很可能会存在差异(例如用户代理或编码,...)。如果您在代码中重新构建标头,结果应该是相同的。

此外,您可以使用HttpUrlConnection,这样您就可以轻松设置或读取相应的 HTTP 标头,就像在this SO 帖子中一样。否则对于URLConnection,请查看here

进一步调查

您的方法恢复了一个特殊的错误页面,这表明该网站使用了来自 Incapsula 的附加安全功能。您获得的网站如下所示:

在调查标头时,我注意到需要存在两个 cookie 字符串,因此您可以直接访问网站,而不是进行安全检查:

visid_incap_...=...
incap_ses_..._...=...

您可以通过一个请求进入错误页面,这会在 Set-Cookie 标头中为您提供两个 cookie 字符串。然后你可以直接请求cookie字符串设置为visid_incap_...=...; incap_ses_..._...=...的网站。您可以多次执行请求,直到 cookie 过期。只需检查错误页面即可检测到这一点。这是工作代码,它显然缺乏样式和额外的检查,但可以解决您的问题。剩下的就看你自己了。

public static String getIncapsulaCookie(String url) {

    String USER_AGENT = "Mozilla/5.0";
    BufferedReader in = null;

    String incapsulaCookie = null;

    try {

        HttpURLConnection cookieConnection =
                (HttpURLConnection) new URL(url).openConnection();
        cookieConnection.setRequestMethod("GET");
        cookieConnection.setRequestProperty("Accept",
                "text/html; charset=UTF-8");
        cookieConnection.setRequestProperty("User-Agent", USER_AGENT);

        // Disable 'keep-alive'
        cookieConnection.setRequestProperty("Connection", "close");

        // Cookies for Incapsula, preserve order
        String visid = null;
        String incap = null;

        cookieConnection.connect();

        for (Entry<String, List<String>> header : cookieConnection
                .getHeaderFields().entrySet()) {

            // Incapsula gives you the required cookies
            if (header.getKey() != null
                    && header.getKey().equals("Set-Cookie")) {

                // Search for the desired cookies
                for (String cookieValue : header.getValue()) {
                    if (cookieValue.contains("visid")) {
                        visid = cookieValue.substring(0,
                                cookieValue.indexOf(";") + 1);
                    }
                    if (cookieValue.contains("incap_ses")) {
                        incap = cookieValue.substring(0,
                                cookieValue.indexOf(";"));
                    }
                }
            }
        }

        incapsulaCookie = visid + " " + incap;

        // Explicitly disconnect, also essential in this method!
        cookieConnection.disconnect();

    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            if (in != null)
                in.close();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }

    return incapsulaCookie;

}

此方法为您提取封装 cookie。这是您的方法的修改版本,它使用了 cookie:

public static String readUrl(String url, String incapsulaCookie) {

    StringBuilder response = new StringBuilder();
    String USER_AGENT = "Mozilla/5.0", inputLine;
    BufferedReader in = null;

    try {

        HttpURLConnection connection =
                (HttpURLConnection) new URL(url).openConnection();
        connection.setRequestMethod("GET");
        connection.setRequestProperty("Accept", "text/html; charset=UTF-8");
        connection.setRequestProperty("User-Agent", USER_AGENT);

        // Set the Incapsula cookie
        connection.setRequestProperty("Cookie", incapsulaCookie);

        in = new BufferedReader(
                new InputStreamReader(connection.getInputStream()));

        while ((inputLine = in.readLine()) != null) {
            response.append(inputLine);
        }

        in.close();

    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            if (in != null)
                in.close();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
    return response.toString();

}

正如我所观察到的,用户代理和其他属性似乎并不重要。您现在可以调用getIncapsulaCookie(String url) 一次或在需要新cookie 时调用,以获取cookie 和readUrl(String url, String incapsulaCookie) 多次 请求页面,直到cookie 过期。结果是 complete HTML 页面,如下图所示部分

重要细节:getIncapsulaCookie(...)方法中有两个基本命令,分别是cookieConnection.setRequestProperty("Connection", "close");cookieConnection.disconnect();。如果您想在之后立即致电readUrl(...),两者都是必需。如果您省略这些命令,HTTP 连接将在您收到 cookie 后在服务器端保持活动,并且下次调用 readUrl(...) 将向您返回错误的页面。您可以通过省略这些命令并调用 getIncapsulaCookie(...) 来尝试此操作,然后等待 5 到 65 秒并调用 readUrl(...)。您会看到这也有效,因为连接会自动超时。另见here

【讨论】:

  • 更新了我对HttpUrlConnectionURLConnection的回答。
  • 更新了我的代码,但还是不行。任何示例代码?
  • 添加了工作代码的解决方案,但有一个奇怪的编码错误。它解决了您的问题并解释了原因。
  • 感谢您的详细回答,我尝试了您的方法,但它只得到了页面的第一部分,而不是结果的内容,请参阅我编辑的问题。
  • 我再次对其进行了测试,并获得了包含所有结果的完整 HTML 网站。该页面以&lt;!DOCTYPE html&gt;&lt;html&gt; &lt;head&gt; &lt;!-- Google Page-level ads --&gt;&lt;script async... 开头。您只需复制粘贴代码即可。您的问题是 21 小时前编辑的,我是 15 小时前编辑的,我没有看到您尝试过我的示例。
猜你喜欢
  • 2013-11-09
  • 2013-05-03
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-11-30
  • 1970-01-01
相关资源
最近更新 更多