【发布时间】:2013-12-10 22:05:50
【问题描述】:
我是 C# 的新手,所以这可能很明显如何让它工作或对我来说太复杂了,但我正在尝试使用 HtmlAgilityPack 设置和抓取网页。目前我的代码可以编译,但是当我编写字符串时,我只得到 1 个结果,而且它恰好是 ul 中 li 的最后一个结果。字符串拆分的原因是我最终可以将标题和描述字符串输出到 .csv 中以供进一步使用。因此,我只是不确定下一步该做什么,为什么我要寻求可以提供的任何帮助/理解/想法/想法/建议。谢谢!
private void button1_Click(object sender, EventArgs e)
{
List<string> cities = new List<string>();
//var xpath = "//h2[span/@id='Cities']";
var xpath = "//h2[span/@id='Cities']" + "/following-sibling::ul[1]" + "/li";
WebClient web = new WebClient();
String html = web.DownloadString("http://wikitravel.org/en/Vietnam");
hap.HtmlDocument doc = new hap.HtmlDocument();
doc.LoadHtml(html);
foreach (hap.HtmlNode node in doc.DocumentNode.SelectNodes(xpath))
{
string all = node.InnerText;
//splits text between '—', '-' or ' ' into 2 parts
string[] split = all.Split(new char[] { '—', ' ', '-' }, StringSplitOptions.None);
string title;
string description;
int nodeCount;
nodeCount = node.ChildNodes.Count;
if (nodeCount == 2)
{
title = node.ChildNodes[0].InnerText;
description = node.ChildNodes[1].InnerText;
}
else if (nodeCount == 4)
{
title = node.ChildNodes[0].InnerText;
description = node.ChildNodes[1].InnerText + node.ChildNodes[2].InnerText;
}
else
{
title = "Error";
description = "The node cound was not 2 or 3. Check the div section.";
}
System.IO.StreamWriter write = new System.IO.StreamWriter(@"C:\Users\cbrannin\Desktop\textTest\testText.txt");
write.WriteLine(all);
write.Close();
}
}
}
【问题讨论】:
标签: c# .net visual-studio-2012 web-scraping html-agility-pack