使用 C# 访问网页的内容答案

【问题标题】：Access the Contents of a Web Page with C#使用 C# 访问网页的内容
【发布时间】：2010-11-10 16:14:37
【问题描述】：

我正在尝试使用 C# 访问网页的内容。例如，我想抓取谷歌主页正文的文本。

我知道这在 C# 中通过其 Web 浏览器控件是可行的。但我找不到一个好的、简单的例子。我在网上找到的所有资源都涉及创建表单和 GUI，我不需要，我只需要一个好的旧控制台应用程序。

如果有人可以提供一个简单的基于控制台的代码 sn-p 来完成上述操作，将不胜感激。

【问题讨论】：

【解决方案1】：

实际上，WebBrowser 是一个 GUI 控件，用于在您想要可视化网页时使用（在 Windows 应用程序中嵌入和管理 Internet Explorer）。如果您只需要获取网页的内容，您可以使用 WebClient 类：

class Program
{
    static void Main(string[] args)
    {
        using (var client = new WebClient())
        {
            var contents = client.DownloadString("http://www.google.com");
            Console.WriteLine(contents);
        }
    }
}

【讨论】：

如果网站是用javascript动态生成的（即，如果html源只是.js文件），这将不起作用，对吧？
@Saobi，你是对的，javascript 不会用这种技术执行。你只会得到网页的纯文本表示。
我基本上想向网站发送查询并获取返回的结果，但该网站都是用 javascript 编写的，所以像谷歌那样解析 HTML 源代码无济于事。我怎样才能：1）在不知道请求 URL 是什么的情况下发送查询 2）解析 javascript 生成页面的内容？我必须模拟击键并发送它？
Javascript 与否，我仍然认为这是正确的方法。如果这意味着您需要了解 javascript 是如何工作的，以便您可以自己进行转换，那么就这样吧。
@Darin：动态生成的元素怎么办？有什么想法吗？

【解决方案2】：

如果您只想要内容而不是实际的浏览器，则可以使用 HttpWebRequest。

这是一个代码示例：http://www.c-sharpcorner.com/Forums/ShowMessages.aspx?ThreadID=58261

【讨论】：

但这并没有给出完整的内容，这里我们指的是javascript

【解决方案3】：

你可以这样做：

Uri u = new Uri( @"http://launcher.worldofwarcraft.com/alert" );
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(u);
HttpWebResponse res = (HttpWebResponse)req.GetResponse();
System.IO.Stream st = res.GetResponseStream();
System.IO.StreamReader sr = new System.IO.StreamReader(st);
string body = sr.ReadToEnd();
System.Console.WriteLine( "{0}", body );

上面的代码显示了美国魔兽世界的维护信息（如果有任何信息已经发布）

【讨论】：

【解决方案4】：

您还可以使用 WatiN 库轻松加载和操作网页。这被设计为 Web UI 的测试库。要使用它，请从官方网站http://watin.sourceforge.net/ 获取最新信息。对于 C#，控制台应用程序中的以下代码将为您提供 Google 主页的 HTML（这是根据 WatiN 站点上的入门示例修改的）。该库还包含更多有用的方法，用于获取和设置页面的各个部分、执行操作和检查结果。

   using System;
    using WatiN.Core;

    namespace Test
    {
      class WatiNConsoleExample
      {
        [STAThread]
        static void Main(string[] args)
        {
          // Open an new Internet Explorer Window and
          // goto the google website.
          IE ie = new IE("http://www.google.com");

          // Write out the HTML text of the body
          Console.WriteLine(ie.Text);


          // Close Internet Explorer and the console window immediately.
          ie.Close();

          Console.Readkey();
        }
      }
    }

【讨论】：

【解决方案5】：

十年过去了，Microsoft 不再推荐使用 WebClient 进行新开发，如最初接受的答案中所述。目前的建议是使用 System.Net.Http 命名空间中的 Httpclient。

来自https://docs.microsoft.com/en-us/dotnet/api/system.net.http.httpclient?view=netcore-3.1的当前示例

是

// HttpClient is intended to be instantiated once per application, rather than per-use. See Remarks.
static readonly HttpClient client = new HttpClient();

static async Task Main()
{
  // Call asynchronous network methods in a try/catch block to handle exceptions.
  try   
  {
     HttpResponseMessage response = await client.GetAsync("http://www.contoso.com/");
     response.EnsureSuccessStatusCode();
     string responseBody = await response.Content.ReadAsStringAsync();
     // Above three lines can be replaced with new helper method below
     // string responseBody = await client.GetStringAsync(uri);

     Console.WriteLine(responseBody);
  }
  catch(HttpRequestException e)
  {
     Console.WriteLine("\nException Caught!");  
     Console.WriteLine("Message :{0} ",e.Message);
  }
}`

【讨论】：

【解决方案6】：

HTML Agility Pack 可能是您需要的。它通过 DOM 和 XPath 提供对 HTML 页面的访问。

【讨论】：

【解决方案7】：

Google 屏幕抓取，如上所述，使用 HttpWebRequest。当你做你正在做的事情时，我建议你使用 Fiddler 来帮助你弄清楚到底发生了什么。

【讨论】：