【问题标题】:Cannot get HTML of page无法获取页面的 HTML
【发布时间】:2012-04-15 17:15:46
【问题描述】:

我想为以下页面使用 HTTPWEBREQUEST 获取 HTML:

http://inkdispatch.com/brother

目前我正在使用:

 public static string getHTML(string url)
    {
        string responseData = "";
        try
        {
            //    System.Threading.Thread.Sleep(1000 * 1);
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            request.Accept = "application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*";
            request.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)";
            request.Timeout = 60000;
            request.AllowAutoRedirect = false;
            request.Method = "GET";
            request.Referer = "inkdispatch.com";
            request.CookieContainer = yummycookies;
            request.KeepAlive = true;

            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            if (response.StatusCode == HttpStatusCode.OK)
            {
                Stream responseStream = response.GetResponseStream();
                StreamReader myStreamReader = new StreamReader(responseStream);
                responseData = myStreamReader.ReadToEnd();
            }
            foreach (Cookie cook in response.Cookies)
            {
                yummycookies.Add(cook);
            }
            response.Close();
        }
        catch (Exception e)
        {
            responseData = "An error occurred: " + e.Message;
        }

        return responseData;

    }

但是我没有看到任何我得到的响应,没有错误,只是说,永久移动,当我在浏览器中放置相同的链接时它可以工作。该链接附加了一个令牌,但我确实从主页获得了它,仍然是同样的问题任何帮助。

更新

我刚刚设置:

 request.AllowAutoRedirect = true;

但我得到错误:

    Too many automatic redirections were attempted.
   at System.Net.HttpWebRequest.GetResponse()
   at inkdispatchcomScraper.Program.getHTML(String url) 

我打开了 fiddler,显示它一次又一次地点击链接:

    #   Result  Protocol    Host    URL Body    Caching Content-Type    Process Comments    Custom  
72  301 HTTP    inkdispatch.com /brother?zenid=00810c6a184e63149cdca848c7f02871 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
73  301 HTTP    inkdispatch.com /brother?zenid=32cf6d38541a90658d39785b6cd64fbc 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
74  301 HTTP    inkdispatch.com /brother?zenid=70d0d5eaa10175d74933ba00d47876f8 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
75  301 HTTP    inkdispatch.com /brother?zenid=fa45c256a07a9450274269cfa4a4e64a 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
76  301 HTTP    inkdispatch.com /brother?zenid=1fb7677a7e6ae0ca32a154ebcc42e043 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
77  301 HTTP    inkdispatch.com /brother?zenid=39923f8100276b1c0fa5ccfb1f8d222c 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
78  301 HTTP    inkdispatch.com /brother?zenid=fef228719b375ac012c4755793a0027a 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
79  301 HTTP    inkdispatch.com /brother?zenid=5c2babf5e6b9b0834f605734441ba208 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
80  301 HTTP    inkdispatch.com /brother?zenid=711bdefa3ca7cccebf63b9b8a3734be1 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
81  301 HTTP    inkdispatch.com /brother?zenid=c55d1b6166994be1436c9473a1519abe 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
83  301 HTTP    inkdispatch.com /brother?zenid=cc66424548f23c3c64b2e0054289283f 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
84  301 HTTP    inkdispatch.com /brother?zenid=6f05f06093cd345d10ca729117994ac0 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
85  301 HTTP    inkdispatch.com /brother?zenid=4a2ab4d3824c4850f544f28cd71bc1bb 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
86  301 HTTP    inkdispatch.com /brother?zenid=6c9d0acd69fc22821014c7e3263da7b6 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
87  301 HTTP    inkdispatch.com /brother?zenid=fff05b8df3a1488add36591a2687a830 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
88  301 HTTP    inkdispatch.com /brother?zenid=b10facbe8bc9b9a355fe648649067f98 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
89  301 HTTP    inkdispatch.com /brother?zenid=8b767c98491178e54d12b4e85ff02b2e 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
90  301 HTTP    inkdispatch.com /brother?zenid=9f0b8cb119fee9a4e276bcae5f13772d 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
91  301 HTTP    inkdispatch.com /brother?zenid=943076fabf058eb1316cfa86aadb1dec 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
92  301 HTTP    inkdispatch.com /brother?zenid=8bd0335032a58b9c399706cd9c695901 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
93  301 HTTP    inkdispatch.com /brother?zenid=a1ba5e21f0af2750d398484e063e8303 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
94  301 HTTP    inkdispatch.com /brother?zenid=e704b2951b1d136c195fd02ad4abec93 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612           
95  301 HTTP    inkdispatch.com /brother?zenid=6d606d0785f19c17ccb1868577a9d546 0   no-store, no-cache, must-revalidate, post-check=0, pre-check=0  Expires: Thu, 19 Nov 1981 08:52:00 GMT  text/html   inkdispatchcomscraper.vshost:4612   

另一个更新

我看到当我在 IE 中打开它时,它使用重定向到 /brother 但在代码的情况下它会得到另一个 ZENID ant 转发,并且这种情况一直在发生。

【问题讨论】:

    标签: c# .net exception-handling httpwebrequest web-scraping


    【解决方案1】:

    设置request.AllowAutoRedirect = true;

    编辑

    对于您的第二个问题,请声明yummycookies,如下所示。

    public static string getHTML(string url)
    {
       CookieContainer yummycookies = new CookieContainer();
       ...
    }
    

    【讨论】:

    • 但它在我的电脑上正常工作。 (我只需要声明static CookieContainer yummycookies = new CookieContainer();
    • 您点击的是哪个链接?它非常适合inkdispatch.com,但没有其他链接。
    • 没有其他链接?你确定吗?它对我有用 http://inkdispatch.com/brother http://www.google.com http://stackoverflow.com 等等。
    • 我的意思是说网站中没有其他链接,对于每个其他链接,然后是主页,不断地反复点击自己,直到我收到上述错误。只有主页返回 HTML 正常,:( .任何建议。
    • 这可能是他们在某些国家/地区检查 ip 以避免?然后我又在 IE 上获得了 html,所以它非常混乱。
    【解决方案2】:

    当我尝试测试您的代码时,它失败了,但在另一次测试中我发现了以下错误“尝试了太多自动重定向。”

    在更新您的代码并再次测试时,它在您提供的 url 上运行良好,html 已正确获取。代码在这里。

    public static string GetHtml2(string urlAddr)
    {
        if (urlAddr == null || string.IsNullOrEmpty(urlAddr))
        {
            throw new ArgumentNullException("urlAddr");
        }
        else
        {
            string result;
    
            //1.Create the request object
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddr);
            //request.AllowAutoRedirect = true;
            //request.MaximumAutomaticRedirections = 200;
            request.Proxy = null;
            request.UseDefaultCredentials = true;
    
            //2.Add the container with the active
            CookieContainer cc = new CookieContainer();
    
    
            //3.Must assing a cookie container for the request to pull the cookies
            request.CookieContainer = cc;
    
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            using (StreamReader sr = new StreamReader(response.GetResponseStream()))
            {
                result = sr.ReadToEnd();
                //Close and clean up the StreamReader
                sr.Close();
            }
            return result;
        }
    }
    

    希望一切都会好起来的。

    【讨论】:

    • 谢谢你同样的问题我认为我身边的一些互联网设置可能:当我这样做时它起作用了:System.Net.ServicePointManager.Expect100Continue = false; WebHeaderCollection myWebHeaderCollection = request.Headers; //在请求中添加 Accept-Language 标头(用于丹麦语)。 myWebHeaderCollection.Add("Accept-Language:en-US"); myWebHeaderCollection.Add("Accept-Encoding", "gzip, deflate"); myWebHeaderCollection.Add("Cookie", "zenid=9ea4d211ba2aa64cbaa148df5de4ab10");
    • 在大多数情况下,如果您启用了“Accept-Encoding”编码,“gzip, deflate”实时提取将无法正常工作,尤其是在目标网站启用此类编码的情况下。
    • 是的,我根本不使用它,但在这种情况下,我必须复制所有东西,只有它起作用。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-09-11
    • 1970-01-01
    • 2012-07-22
    • 2020-08-03
    相关资源
    最近更新 更多