有人可以解释一下这段 HtmlAgilityPack 代码吗？答案

【问题标题】：Can someone please explain this bit of HtmlAgilityPack code?有人可以解释一下这段 HtmlAgilityPack 代码吗？
【发布时间】：2010-12-21 15:20:10
【问题描述】：

我已尽力通过代码添加 cmets，但我有点卡在某些部分。

// create a new instance of the HtmlDocument Class called doc
1: HtmlDocument doc = new HtmlDocument();

// the Load method is called here to load the variable result which is html 
// formatted into a string in a previous code snippet
2: doc.Load(new StringReader(result));

// a new variable called root with datatype HtmlNode is created here. 
// Im not sure what doc.DocumentNode refers to?
3: HtmlNode root = doc.DocumentNode;
4:  

// a list is getting constructed here. I haven't had much experience 
// with constructing lists yet
5: List<string> anchorTags = new List<string>();
6:  

// a foreach loop is used to loop through the html document to 
// extract html with 'a' attributes I think..      
7: foreach (HtmlNode link in root.SelectNodes("//a"))
8: {
// dont really know whats going on here
9:     string att = link.OuterHtml;
// dont really know whats going on here too
10:     anchorTags.Add(att)
11: }

我已从here 提取此代码示例。感谢 Farooq Kaiser

【问题讨论】：

我从未使用过该库，我只是在这里暗中尝试，但我假设 doc.DocumentNode 是文档的当前节点，加载后该文档将是根节点。

标签： c# html-agility-pack web-scraping

【解决方案1】：

在 HTML Agility Pack 术语中，“//a”表示“在文档中的任何位置查找所有名为 'a' 或 'A' 的标签”。有关 XPATH 的更一般帮助，请参阅 XPATH 文档（独立于 HTML 敏捷包）。因此，如果您的文档如下所示：

<div>
  <A href="xxx">anchor 1</a>
  <table ...>
    <a href="zzz">anchor 2</A>
  </table>
</div>

您将获得两个锚点 HTML 元素。 OuterHtml 代表节点的 HTML，包括节点本身，而 InnerHtml 仅代表节点的 HTML 内容。所以，这里的两个 OuterHtml 是：

  <A href="xxx">anchor 1</a>

和

<a href="zzz">anchor 2</A>

请注意，我指定了“a”或“A”，因为 HAP 实现需要注意或 HTML 不区分大小写。并且“//A”默认情况下不起作用。您需要使用小写指定标签。

【讨论】：

您能否更新您的答案以显示 InnerHTML 的示例？
InnerHtml 和 OuterHtml 一样使用，例如：string AttContent = link.InnerHtml;
intern.html 指的是什么？如果outerhtml指的是abc
在这种情况下，innerHtml 将是“abc”

【解决方案2】：

关键是 SelectNodes 方法。这部分使用 XPath 从 HTML 中获取与您的查询匹配的节点列表。

这是我学习 XPath 的地方：http://www.w3schools.com/xpath/default.asp

然后它只是遍历那些匹配的节点并获取 OuterHTML - 包括标签的完整 HTML，并将它们添加到字符串列表中。 List 基本上只是一个数组，但更灵活。如果您只想要内容，而不是封闭标签，您可以使用 HtmlNode.InnerHTML 或 HtmlNode.InnerText。

【讨论】：

+1，对于那些发现 XPath 难以理解的人，您可以使用 Elements()/Descendents()，然后使用标准 LinqToXml XElement 语法查询所有内容。
@Kirk Woll 我不知道。我宁愿使用Linq。每次我使用 XPath 时，我都必须再次查看备忘单。