Asp.net Crawler Webresponse 操作超时答案

【问题标题】：Asp.net Crawler Webresponse Operation Timed outAsp.net Crawler Webresponse 操作超时
【发布时间】：2011-02-20 17:44:33
【问题描述】：

您好，我在我的网络应用程序中构建了一个简单的基于线程池的网络爬虫。它的工作是爬取自己的应用程序空间，并为每个有效网页及其元内容构建一个 Lucene 索引。这就是问题所在。当我从 Visual Studio Express 的调试服务器实例运行爬虫并提供起始实例作为 IIS url 时，它工作正常。但是，当我不提供 IIS 实例并且它需要自己的 url 来启动爬网过程（即爬网自己的域空间）时，我会受到 Webresponse 语句上的操作超时异常的打击。有人可以指导我在这里应该或不应该做什么吗？这是我获取页面的代码。它在多线程环境中执行。

private static string GetWebText(string url)
    {
        string htmlText = "";        

        HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
        request.UserAgent = "My Crawler";

        using (WebResponse response = request.GetResponse())
        {
            using (Stream stream = response.GetResponseStream())
            {
                using (StreamReader reader = new StreamReader(stream))
                {
                    htmlText = reader.ReadToEnd();
                }
            }
        }
        return htmlText;
    }

以下是我的堆栈跟踪：

at System.Net.HttpWebRequest.GetResponse()
   at CSharpCrawler.Crawler.GetWebText(String url) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 366
   at CSharpCrawler.Crawler.CrawlPage(String url, List`1 threadCityList) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 105
   at CSharpCrawler.Crawler.CrawlSiteBuildIndex(String hostUrl, String urlToBeginSearchFrom, List`1 threadCityList) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 89
   at crawler_Default.threadedCrawlSiteBuildIndex(Object threadedCrawlerObj) in c:\myAppDev\myApp\site\crawler\Default.aspx.cs:line 108
   at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(Object state)
   at System.Threading.ExecutionContext.runTryCode(Object userData)
   at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx)
   at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()

感谢和欢呼，莱昂。

【问题讨论】：

也许它没有找到您传递的网址？你检查了吗？如果你传递了一个无法定位的地址，那么它会等待连接直到超时。
嗨 Aristos，是的，我已经确认我通过的页面是“可浏览的”。但是，就像我说的，当爬虫在自己的服务器空间上运行时，我也无法访问服务器上的任何站点。收到 403 用户过多消息。但是话虽如此，当我尝试为另一个服务器实例而不是它自己的服务器实例运行爬虫时，这个问题不会重现。
嗨，@Leon 我也面临同样的问题，您找到解决方案了吗？

标签： c# asp.net web-crawler httpwebresponse

【解决方案1】：

您的爬虫发出了多少并发请求？您很容易使线程池处于饥饿状态 - 特别是当爬虫在网站代码中运行时。

您像这样调用的每个请求都将使用池中的 2 个线程 - 一个用于处理请求，另一个用于等待响应。

【讨论】：

我现在继续尝试使用最大线程并处理我使用的所有内容以及 HttpWebResponse。没有运气。我似乎无法理解为什么我的爬虫能够从 Visual Studio 中的调试器中爬取完全相同的站点，而是爬取 IIS 实例；并且无法爬取“自身”，即从调试器或IIS实例爬取服务器的调试实例，当爬虫从IIS运行时。