【问题标题】:Downloading page with HtmlUnit throws exception使用 HtmlUnit 下载页面会引发异常
【发布时间】:2013-01-13 03:46:56
【问题描述】:

我正在尝试使用 HtmlUnit 2.11 下载此页面:http://ekw.ms.gov.pl/pdcbdkw/pdcbdkw.html。在这个页面GET之后是JS脚本计算和设置cookie并提交表单。 我使标题看起来与我的 FF 18 上的完全一样 - 我检查了它,由 HtmlUnit 用 javascript 计算的标题和 cookie 是正确的。我通过下载带有脚本的页面并更改表单中的操作来检查它以发布我自己的脚本,从而打印请求标头和正文。我使用以下代码:

WebClient client = new WebClient();

        client.setWebConnection(new HttpWebConnection(client) {
            public WebResponse getResponse(WebRequest request) throws IOException {
                if(request.getHttpMethod() == HttpMethod.POST) {
                    request.removeAdditionalHeader("Cache-Control");
                }
                else {
                    request.setAdditionalHeader("Cache-Control","max-age=0");
                }
                WebResponse response = super.getResponse(request);
                return response;
            }
        });

        client.addRequestHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0");
        client.addRequestHeader("Host","ekw.ms.gov.pl");
        client.addRequestHeader("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        client.addRequestHeader("Accept-Language","en-gb,en;q=0.5");
        client.addRequestHeader("Accept-Encoding","gzip, deflate");
        client.addRequestHeader("Connection", "keep-alive");
        HtmlPage html = client.getPage("http://ekw.ms.gov.pl/pdcbdkw/pdcbdkw.html");

执行代码时出现以下异常(快捷方式是连接重置引起的):

Exception in thread "main" ======= EXCEPTION START ========
Exception class=[java.lang.RuntimeException]
com.gargoylesoftware.htmlunit.ScriptException: Exception invoking submit
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:663)
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:559)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:525)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:594)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:569)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptFunctionIfPossible(HtmlPage.java:996)
    at com.gargoylesoftware.htmlunit.javascript.host.EventListenersContainer.executeEventHandler(EventListenersContainer.java:208)
    at com.gargoylesoftware.htmlunit.javascript.host.EventListenersContainer.executeBubblingListeners(EventListenersContainer.java:227)
    at com.gargoylesoftware.htmlunit.javascript.host.Node.fireEvent(Node.java:813)
    at com.gargoylesoftware.htmlunit.javascript.host.Node.fireEvent(Node.java:737)
    at com.gargoylesoftware.htmlunit.html.HtmlElement$1.run(HtmlElement.java:853)
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:559)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:525)
    at com.gargoylesoftware.htmlunit.html.HtmlElement.fireEvent(HtmlElement.java:858)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.executeEventHandlersIfNeeded(HtmlPage.java:1259)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:227)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:440)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:308)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
    at cenatorium.kwtest.App.main(App.java:49)
Caused by: java.lang.RuntimeException: Exception invoking submit
    at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:163)
    at net.sourceforge.htmlunit.corejs.javascript.FunctionObject.call(FunctionObject.java:452)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1473)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:815)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:109)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:415)
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:274)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3132)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:107)
    at com.gargoylesoftware.htmlunit.javascript.host.EventHandler.call(EventHandler.java:80)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:587)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:651)
    ... 20 more
Caused by: java.lang.RuntimeException: java.net.SocketException: Connection reset
    at com.gargoylesoftware.htmlunit.WebClient.download(WebClient.java:2233)
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLFormElement.submit(HTMLFormElement.java:310)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:137)
    ... 31 more
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:189)
    at java.net.SocketInputStream.read(SocketInputStream.java:121)
    at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
    at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
    at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
    at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
    at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
    at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
    at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:712)
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:517)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
    at com.gargoylesoftware.htmlunit.HttpWebConnection.getResponse(HttpWebConnection.java:171)
    at cenatorium.kwtest.App$1.getResponse(App.java:38)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1484)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponse(WebClient.java:1402)
    at com.gargoylesoftware.htmlunit.WebClient.download(WebClient.java:2226)
    ... 37 more
Enclosed exception: 
java.lang.RuntimeException: Exception invoking submit
    at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:163)
    at net.sourceforge.htmlunit.corejs.javascript.FunctionObject.call(FunctionObject.java:452)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1473)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:815)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:109)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:415)
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:274)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3132)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:107)
    at com.gargoylesoftware.htmlunit.javascript.host.EventHandler.call(EventHandler.java:80)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:587)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:651)
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:559)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:525)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:594)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:569)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptFunctionIfPossible(HtmlPage.java:996)
    at com.gargoylesoftware.htmlunit.javascript.host.EventListenersContainer.executeEventHandler(EventListenersContainer.java:208)
    at com.gargoylesoftware.htmlunit.javascript.host.EventListenersContainer.executeBubblingListeners(EventListenersContainer.java:227)
    at com.gargoylesoftware.htmlunit.javascript.host.Node.fireEvent(Node.java:813)
    at com.gargoylesoftware.htmlunit.javascript.host.Node.fireEvent(Node.java:737)
    at com.gargoylesoftware.htmlunit.html.HtmlElement$1.run(HtmlElement.java:853)
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:559)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:525)
    at com.gargoylesoftware.htmlunit.html.HtmlElement.fireEvent(HtmlElement.java:858)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.executeEventHandlersIfNeeded(HtmlPage.java:1259)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:227)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:440)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:308)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
    at cenatorium.kwtest.App.main(App.java:49)
Caused by: java.lang.RuntimeException: java.net.SocketException: Connection reset
    at com.gargoylesoftware.htmlunit.WebClient.download(WebClient.java:2233)
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLFormElement.submit(HTMLFormElement.java:310)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:137)
    ... 31 more
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:189)
    at java.net.SocketInputStream.read(SocketInputStream.java:121)
    at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
    at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
    at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
    at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
    at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
    at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
    at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:712)
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:517)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
    at com.gargoylesoftware.htmlunit.HttpWebConnection.getResponse(HttpWebConnection.java:171)
    at cenatorium.kwtest.App$1.getResponse(App.java:38)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1484)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponse(WebClient.java:1402)
    at com.gargoylesoftware.htmlunit.WebClient.download(WebClient.java:2226)
    ... 37 more
== CALLING JAVASCRIPT ==
function () {
    [native code, arity=0]
}

======= EXCEPTION END ========

以下是 POST 中发送的标头(与 FF18 中相同,但顺序不同但没关系):

User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0
Accept-Language: en-gb,en;q=0.5
Host: localhost
Referer: http://localhost/test.php
Accept-Encoding: gzip, deflate
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Content-Length: 69
Content-Type: application/x-www-form-urlencoded
Cookie: TS37535a_75=a538065ddccf975d9ae56e368a94eea0:hklh:781te27c:861731662
----------------- POST content
TS37535a_id=3&TS37535a_md=1&TS37535a_rf=0&TS37535a_ct=0&TS37535a_pd=0 

cookie 的值是正确的 - 我检查过了。

那怎么了??我如何获取此页面 - 它应该返回 html 表单,如您在浏览器中键入 http://ekw.ms.gov.pl/pdcbdkw/pdcbdkw.html 所见...

【问题讨论】:

    标签: java javascript http http-headers htmlunit


    【解决方案1】:

    经过 2 个晚上并编写自己的 http 客户端并发送原始请求后,我发现:

    1. 在 POST 请求中,数据的发送顺序与它们以 html 形式出现的顺序不同:

      key2=val2&key1=val1&key3=val3

    应该是

    key1=val1&key2=val2&key3=val3
    

    它应该没有问题,但对于这个服务器来说确实如此。

    1. 在向服务器发出第二次请求后,我的套接字上总是出现connection reset 异常,我必须重新连接并重试请求。

    【讨论】:

      猜你喜欢
      • 2012-01-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-03-28
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多