获取 DIV 内的链接答案

【问题标题】：Get links inside a DIV获取 DIV 内的链接
【发布时间】：2014-04-29 12:36:43
【问题描述】：

我希望能够从这个 div 中获取第一个链接。

    <div id="first-tweet-wrapper">
    <blockquote class="tweet" lang="en">
    <a href="htttp://link.com">                          <--- This one
      text    </a>
  </blockquote>
  <a href="http://link2.net" class="click-tracking" target="_blank"
     data-tracking-category="discover" data-tracking-action="tweet-the-tweet">
    Tweet it!  </a>
</div>

我试过这段代码，但它不起作用

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(source);

var div = doc.DocumentNode.SelectSingleNode("//div[@id='first-tweet-wrapper']");
if (div != null)
{
      var links = div.Descendants("a")
          .Select(a => a.InnerText)
          .ToList();
}

【问题讨论】：

与论坛网站不同，我们不使用“谢谢”、“任何帮助表示赞赏”或Stack Overflow 上的签名。请参阅“Should 'Hi', 'thanks,' taglines, and salutations be removed from posts?.

标签： c# .net xpath windows-phone-8 html-agility-pack

【解决方案1】：

您需要使用 HtmlAgilityPack 的 GetAttributeValue 方法获取 anchor element 的 href-attribute 的值。您可以通过直接提取父块代码元素的内容来访问单个锚元素，如下所示：

//div[@id='first-tweet-wrapper']/blockquote[@class='twitter-tweet']

然后获取里面的单个链接。一个可能的解决方案可能如下所示（在这种情况下，输入是 facebook，但也适用于 microsoft）：

try
{           
    // download the html source
    var webClient = new WebClient();
    var source = webClient.DownloadString(@"https://discover.twitter.com/first-tweet?username=facebook#facebook");
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(source);

    var div = doc.DocumentNode.SelectSingleNode("//div[@id='first-tweet-wrapper']/blockquote[@class='twitter-tweet']");
    if (div != null)
    {
        // there is only one links
        var link = div.Descendants("a").FirstOrDefault();
        if (link != null)
        {
            // take the value of the attribute
            var href = link.GetAttributeValue("href", "");
            Console.WriteLine(href);
        }
    }
}
catch (Exception exception)
{
    Console.WriteLine(exception.Message);
}

这种情况下的输出是：

https://twitter.com/facebook/statuses/936094700

另一种可能性是使用 XPath 直接选择锚元素（如 @har07 建议的那样）：

    var xpath = @"//div[@id='first-tweet-wrapper']/blockquote[@class='twitter-tweet']/a";
    var link = doc.DocumentNode.SelectSingleNode(xpath);
    if (link != null)
    {
        // take the value of the href-attribute
        var href = link.GetAttributeValue("href", "");
        Console.WriteLine(href);
    }

输出同上。

【讨论】：

它适用于您示例中的 HTML 代码部分。你从哪里收到你的意见？您可以粘贴网址或完整的 HTML 页面吗？可能是您的输入有多个 div，其中包含一个链接。
我添加了一个检查以避免空结果。你能在你的问题中给出完整的页面代码吗？
我发布的 html 是此页面discover.twitter.com/first-tweet#Microsoft 的一部分。我需要获取推文的链接
我把答案改成了下载页面并提取链接

【解决方案2】：

假设您的 <div> id 是“first-tweet-wrapper”而不是“firt”，您可以使用此 XPath 查询来获取 <a> 内的 <a> 元素 <blockquote> ：

//div[@id='first-tweet-wrapper']/blockquote/a

所以您的代码将如下所示：

var a = doc.DocumentNode
             .SelectSingleNode("//div[@id='first-tweet-wrapper']/blockquote/a");
if (a != null)
{
      var text = a.InnerText;
      var link = a.GetAttributeValue("href", "");
}

【讨论】：

试过但“a”为空
鉴于 html/xml 与发布的完全相同，这应该可以工作（经过测试）。如果您没有使用此问题中的那个进行测试，请发布实际的 xml。
我发布的 html 是此页面discover.twitter.com/first-tweet#Microsoft 的一部分。我需要获取推文的链接