【问题标题】:HtmlAgilityPack - Getting rid of Ads between html comment tagsHtmlAgilityPack - 摆脱 html 评论标签之间的广告
【发布时间】:2014-09-01 05:46:29
【问题描述】:

我需要去掉<!-- custom ads --><!-- /custom ads -->之间的部分 在这段代码中,sn-p。

<!-- custom ads -->
<div style="float:left">
  <!-- custom_Forum_Postbit_336x280 -->
  <div id='div-gpt-ad-1526374586789-2' style='width:336px; height:280px;'>
    <script type='text/javascript'>
       googletag.display('div-gpt-ad-1526374586789-2');
    </script>
  </div>
</div>
<div style="float:left; padding-left:20px">
  <!-- custom_Forum_Postbit_336x280_r -->
  <div id='div-gpt-ad-1526374586789-3' style='width:336px; height:280px;'>
    <script type='text/javascript'>
      googletag.display('div-gpt-ad-1526374586789-3');
    </script>
   </div>
</div>
<div class="clear"></div>

 <br>
<!-- /custom ads -->


<!-- google_ad_section_start -->Some Text,<br>
Some More Text...<br>
<!-- google_ad_section_end -->

我已经可以使用此 xPath //comment()[contains(., 'custom')] 找到两个 cmets,但现在我不知道如何删除位于这些“标签”之间的所有内容。

        foreach (var comment in htmlDoc.DocumentNode.SelectNodes("//comment()[contains(., 'custom')]"))
        {
            MessageBox.Show(comment.OuterHtml);
        }

有什么建议吗?

【问题讨论】:

  • 获取 2 个评论标签的父节点中的所有节点,然后遍历所有子节点并删除从第一条评论到第二条评论的节点。
  • var newhtml = Regex.Replace(html, Regex.Escape(start) + ".+?" + Regex.Escape(end), "", RegexOptions.Singleline);

标签: c# xpath html-agility-pack


【解决方案1】:
//find all comment nodes that contain "custom ads"
var nodes = doc.DocumentNode
               .Descendants()
               .OfType<HtmlCommentNode>()
               .Where(c => c.Comment.Contains("custom ads"))
               .ToList();
//create a sequence of pairs of nodes
var nodePairs = nodes
    .Select((node, index) => new {node, index})
    .GroupBy(x => x.index / 2)
    .Select(g => g.ToArray())
    .Select(a => new { startComment = a[0].node, endComment = a[1].node});

foreach (var pair in nodePairs)
{
    var startNode = pair.startComment;
    var endNode = pair.endComment;
    //check they share the same parent or the wheels will fall off
    if(startNode.ParentNode != endNode.ParentNode) throw new Exception();
    //iterate all nodes inbetween
    var currentNode = startNode.NextSibling;
    while(currentNode != endNode)
    {
        //currentNode won't have siblings when we trim it from the doc
        //so grab the nextSibling while it's still attached
        var n = currentNode.NextSibling;
        //and cut out currentNode
        currentNode.Remove();
        currentNode = n;
    }
}

【讨论】:

  • 谢谢,看起来不错,if(nodes.Count != 2) throw new Exception() 不适合我,网页上可以有多个广告。但总会有至少 1 个。
  • 非常感谢。我刚刚用 for 循环包围了你的第一个代码。不过这个很扎实!
猜你喜欢
  • 2020-10-19
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2011-11-05
  • 2011-08-25
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多