自动提取网页的信息，并分析之 ()

本文是参照摩诘的Blog
今天遇到这样一个问题，从政府网站中，根据一个关键数据KeyData，提取相关数据。
这个问题可分为三部分解决：
1）取得政府网站交互的方法；
2）按照合适的方法，用HttpWebResponse，取得相关数据
3）分析取回来的数据

第一部分：获取网站交互信息，采用工具ieHTTPHeadersSetup.exe
得到的数据如下：
GET /search.asp?key=2006002995&ys_type=hy&imageField2.x=32&imageField2.y=20 HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
Accept-Language: zh-cn
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Host: http://www.suzhou-logistics.com/
Connection: Keep-Alive

可以看出，
url: http://http://www.suzhou-logistics.com//search.asp?
Data:key=2006002995&ys_type=hy&imageField2.x=32&imageField2.y=20

也可以直接作为url：http://www.suzhou-logistics.com/search.asp?key=2006002995&ys_type=hy&imageField2.x=32&imageField2.y=20

第二部分：根据第一部分的分析，通过HttpWebResponse取HTML
在此就给出一个通用的函数

public static string GetPage(string url, string postData,string encodeType,out string err)
自动提取网页的信息，并分析之 ()

第三部分：分析Html数据，有两个开源软件
SgmlReader与HtmlAgilityPack20，由于本人机器上只有vs2003，无法使用vs2005版本HtmlAgilityPack20。所以下面用SgmlReader来分析。SgmlReader可以将Html解析成格式完整的类似XML数据，可以采用Xpath进行查询，获取我们想要的数据。
取得完整的xml数据后的分析，根据post页面数据格式的不同而有区别。我取的这个页面，主要用了两个DataTable，一个保存一行基本数据，另一个保存多行的状态数据。

public static DataSet ParsePage(string pageContent, string xclpath,string xrpath,out string err)
}

有了上面的代码就可以采用如下方法调用了

private void Button1_Click(object sender, System.EventArgs e)
}

其实SgmlReader可以直接完成从URl抓取数据的功能，即将第二部分与第三部分合并。

string SgmlReaderTest(Uri baseUri, string url, TextWriter log, bool upper, bool formatted)
}