C# 使用 htmlagility 抓取 url答案

【问题标题】：C# grab urls using htmlagilityC# 使用 htmlagility 抓取 url
【发布时间】：2012-08-15 00:57:40
【问题描述】：

好的，我在这个网页上有这个 URL 列表，我想知道如何获取这些 URL 并将它们添加到 ArrayList 中？

http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A

我只想要列表中的 URL，查看它以了解我的意思。我自己尝试过，无论出于何种原因，它都需要所有其他 URL，除了我需要的 URL。

   http://pastebin.com/a7hJnXPP

【问题讨论】：

标签： c# html url html-agility-pack

【解决方案1】：

如果您只想要列表中的那些，那么下面的代码应该可以工作（假设您已经将页面加载到HtmlDocument）

List<string> hrefList = new List<string>(); //Make a list cause lists are cool.

foreach (HtmlNode node animePage.DocumentNode.SelectNodes("//a[contains(@href, 'id=')]"))
{
    //Append animenewsnetwork.com to the beginning of the href value and add it
    // to the list.
    hrefList.Add("http://www.animenewsnetwork.com" + node.GetAttributeValue("href", "null"));
}

//a[contains(@href, 'id=')] 将此 XPath 分解如下：

//a 选择所有 <a> 节点...
[contains(@href, 'id=')] ... 包含一个 href 属性，该属性包含文本 id=。

这应该足以让你继续前进。

顺便说一句，考虑到该页面上大约有 500 个链接，我建议不要在其自己的消息框中列出每个链接。 500 个链接 = 500 个消息框 :(

【讨论】：

【解决方案2】：

使用Html Agility Pack

using (var wc = new WebClient())
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(wc.DownloadString("http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A"));
    var links = doc.DocumentNode.SelectSingleNode("//div[@class='lst']")
        .Descendants("a")
        .Select(x => x.Attributes["href"].Value)
        .ToArray();
}

【讨论】：

您如何知道在 SingleNode 或 SelectNode 区域中放置什么？为什么 //div[@class='1st'].是什么让你这样做？
我用 chrome 打开了页面并进行了检查。 PS：它的lst不是1st