【问题标题】:URLConnection parsing with BufferedReader prints Gibberish, trying to solve it results in URLConnection.getInputStream returns fileNotFoundException使用 BufferedReader 解析 URLConnection 会打印乱码,尝试解决它会导致 URLConnection.getInputStream 返回 fileNotFoundException
【发布时间】:2016-02-02 06:28:42
【问题描述】:

我有以下 Java 代码来解析网站代码:

URL url = new URL(urlToParse);
URLConnection con = url.openConnection();
InputStream is =con.getInputStream(); 
BufferedReader br = new BufferedReader(new InputStreamReader(is));

urlToParse 作为参数传递给此函数,等于“http://www.omegatiming.com/file/download/?id=00010F0200FFFFFFFFFFFFFFFFFFFF03”。
代码来自here
输出是 Gibberish - 充满了问号和未知字符。

我尝试在 openConnection() 行之后添加这 5 行。

con.setRequestMethod("GET");
con.setDoOutput(true);
con.setReadTimeout(2000);
con.setChunkedStreamingMode(0);
con.connect();  

从提供的解决方案 here,但后来我得到了这个异常:
线程“main”java.io.FileNotFoundException 中的异常:http://www.omegatiming.com/file/download/?id=00010F0200FFFFFFFFFFFFFFFFFFFF03 在 sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1835) 在 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440) 来自行 InputStream is =con.getInputStream();

将此链接复制到浏览器会将我定向到该网站,因此不可能是该网站无效,但调用 con.getresposeCode() 返回 404。

当尝试从 getErrorStream() 获取错误时,它会打印:

<!DOCTYPE html>
<html>
    <head>
    <title>The resource cannot be found.</title>
    <meta name="viewport" content="width=device-width" />
    <style>
     body {font-family:"Verdana";font-weight:normal;font-size: .7em;color:black;} 
     p {font-family:"Verdana";font-weight:normal;color:black;margin-top: -5px}
     b {font-family:"Verdana";font-weight:bold;color:black;margin-top: -5px}
     H1 { font-family:"Verdana";font-weight:normal;font-size:18pt;color:red }
     H2 { font-family:"Verdana";font-weight:normal;font-size:14pt;color:maroon }
     pre {font-family:"Consolas","Lucida Console",Monospace;font-size:11pt;margin:0;padding:0.5em;line-height:14pt}
     .marker {font-weight: bold; color: black;text-decoration: none;}
     .version {color: gray;}
     .error {margin-bottom: 10px;}
     .expandable { text-decoration:underline; font-weight:bold; color:navy; cursor:hand; }
     @media screen and (max-width: 639px) {
      pre { width: 440px; overflow: auto; white-space: pre-wrap; word-wrap: break-word; }
     }
     @media screen and (max-width: 479px) {
      pre { width: 280px; }
     }
    </style>
</head>

<body bgcolor="white">

        <span><H1>Server Error in '/' Application.<hr width=100% size=1 color=silver></H1>

        <h2> <i>The resource cannot be found.</i> </h2></span>

        <font face="Arial, Helvetica, Geneva, SunSans-Regular, sans-serif ">

        <b> Description: </b>HTTP 404. The resource you are looking for (or one of its dependencies) could have been removed, had its name changed, or is temporarily unavailable. &nbsp;Please review the following URL and make sure that it is spelled correctly.
        <br><br>

        <b> Requested URL: </b>/file/download/<br><br>

        <hr width=100% size=1 color=silver>

        <b>Version Information:</b>&nbsp;Microsoft .NET Framework Version:4.0.30319; ASP.NET Version:4.0.30319.34248

        </font>

</body>  

 HttpException:  A public action method &#39;download&#39; was not found on controller     &#39;SwissTiming.DocMgmt.DMSWeb.Controllers.FileController&#39;.
at System.Web.Mvc.Controller.HandleUnknownAction(String actionName)
at System.Web.Mvc.Controller.<BeginExecuteCore>b__1d(IAsyncResult asyncResult, ExecuteCoreState innerState)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.Controller.EndExecuteCore(IAsyncResult asyncResult)
at System.Web.Mvc.Controller.<BeginExecute>b__15(IAsyncResult asyncResult, Controller controller)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.Controller.EndExecute(IAsyncResult asyncResult)
at System.Web.Mvc.Controller.System.Web.Mvc.Async.IAsyncController.EndExecute(IAsyncResult asyncResult)
at System.Web.Mvc.MvcHandler.<BeginProcessRequest>b__5(IAsyncResult asyncResult, ProcessRequestState innerState)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.MvcHandler.EndProcessRequest(IAsyncResult asyncResult)
at System.Web.Mvc.MvcHandler.System.Web.IHttpAsyncHandler.EndProcessRequest(IAsyncResult result)
at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)
--><!-- 
This error page might contain sensitive information because ASP.NET is configured to show verbose error messages using &lt;customErrors mode="Off"/&gt;. Consider using &lt;customErrors mode="On"/&gt; or &lt;customErrors mode="RemoteOnly"/&gt; in production environments.-->  

这基本上就是我卡住的地方,根本无法理解问题所在。我什至不知道 ASP.NET 是从哪里来的。

绕过未解决问题的其他尝试:
1.添加
httpConnection.setRequestProperty("User-Agent","Mozilla/5.0 ( compatible )");
httpConnection.setRequestProperty("Accept","
/");,
按照建议here。还尝试按照建议的here 使用来自this 的userAgent。
仍然在 getInputStream() 中得到 FileNotFoundException
2.添加 * System.setProperty("http.agent", "");*
如上所述here.
3. 回到最初的问题(打印 Gibberish)- 我尝试以这种方式更改对 InputStreamReader 的调用:
new InputStreamReader(new URL("www.website.com").openStream(), "UTF- 8") 正如评论 here 中提到的那样,但它没有改变任何东西。
4.添加行:
con.setRequestMethod("POST"); con.setDoInput(true);
仍然收到 fileNotFoundException。

我很困惑。

我什至不确定我是否有编码问题(因为在尝试通过向连接添加东西来解决之前,没有例外,“只是”错误的输出)。
或者我的连接有其他问题,我无法从中获取输入(如果是这样,这个特定网站有什么特别之处,因为引导我访问这个网站的网站,例如http://www.omegatiming.com/Competition?id=00010F0200FFFFFFFFFFFFFFFFFFFFFF&sport=AQ&year=2015,可以在没有的情况下解析一个问题)。

[[这里][1]:Using Java to pull data from a webpage?
[这里][2]:Trying to read from a URL(in Java) produces gibberish on certain occaisions
[这里][3]:URLConnection FileNotFoundException for non-standard HTTP port sources
[这里][4]:Setting "User-Agent" parameters for URLConnection for querying Google from a Java application
[这里][5]:Setting user agent of a java URLConnection
[这里][6]:Trying to read from a URL(in Java) produces gibberish on certain occaisions

[这个][1]:http://www.whatsmyuseragent.com/

【问题讨论】:

    标签: asp.net-mvc character-encoding html-parsing inputstream filenotfoundexception


    【解决方案1】:

    设法绕过了直接从 Web 解析文件的需要。

    通过将写入 here 的依赖项添加到我的 pom.xml 并运行 mvn clean install,我得到了 pdfbox
    然后将文件下载到我的电脑中,使用的信息是this post。
    然后(现在我有 pdfbox)我添加了这 3 行:

     PDDocument pdf = PDDocument.load(new File(“sample.pdf”));
     PDFTextStripper stripper = new PDFTextStripper();
     String plainText = stripper.getText(pdf);
    

    here 所述。

    这不是完美的解决方案,它会消耗我电脑中的内存来存储我系统上的文件(也许每次只存储一个文件并删除,仍然没有检查过),并且可能会消耗太多的内存程序必须通过 getText() 方法完成对完整文件的解析,但它解决了我的问题,即如何解析这个特定的网站,这对我的程序很重要,仅用于提取其中的文本.

    [这里][1]:http://pdfbox.apache.org/2.0/getting-started.html
    [这里][2]:http://blog.e-zest.net/extracting-text-from-a-pdf-file/

    [这个][1]:How to download a PDF from a given URL in Java?

    【讨论】:

      猜你喜欢
      • 2022-01-08
      • 2015-11-21
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-10-05
      • 1970-01-01
      • 1970-01-01
      • 2011-04-18
      相关资源
      最近更新 更多