【问题标题】:html agility pack question in parsing解析中的html敏捷包问题
【发布时间】:2010-06-24 20:13:51
【问题描述】:

我有这个简单的字符串:

string testString = "6/21 <span style='font-size: x-small; font-family: Arial'><span style='font-size: 10pt; font-family: Arial'>Just got 78th street</span></span>";

如何使用 html 敏捷包仅解析文本。

请注意:有一个跨度嵌套在另一个跨度内。

谢谢, 棒。

【问题讨论】:

    标签: c# asp.net html-agility-pack


    【解决方案1】:

    我认为InnertText 属性应该只给出文本 -

    var testString = "6/21 <span style='font-size: x-small; font-family: Arial'><span style='font-size: 10pt; font-family: Arial'>Just got 78th street</span></span>";
    var doc = new HtmlDocument();
    doc.LoadHtml(testString);
    var justTheText = doc.DocumentNode.InnerText;
    

    此代码将返回 -

    6/21 Just got 78th street
    

    这是你想要的吗?

    【讨论】:

    • 你明白了。谢谢你的帮助,棒。
    【解决方案2】:

    另一种方法是使用 Sanitizer 类来删除您不想要的标签,同时保留文本。

    public static class HtmlSanitizer
        {
            private static readonly IDictionary<string, string[]> Whitelist;
            private static List<string> DeletableNodesXpath = new List<string>();
    
            static HtmlSanitizer()
            {
                Whitelist = new Dictionary<string, string[]> {
                    { "a", new[] { "href" } },
                    { "strong", null },
                    { "em", null },
                    { "blockquote", null },
                    { "b", null},
                    { "p", null},
                    { "ul", null},
                    { "ol", null},
                    { "li", null},
                    { "div", new[] { "align" } },
                    { "strike", null},
                    { "u", null},                
                    { "sub", null},
                    { "sup", null},
                    { "table", null },
                    { "tr", null },
                    { "td", null },
                    { "th", null }
                    };
            }
    
            public static string Sanitize(string input)
            {
                if (input.Trim().Length < 1)
                    return string.Empty;
                var htmlDocument = new HtmlDocument();
    
                htmlDocument.LoadHtml(input);            
                SanitizeNode(htmlDocument.DocumentNode);
                string xPath = HtmlSanitizer.CreateXPath();
    
                return StripHtml(htmlDocument.DocumentNode.WriteTo().Trim(), xPath);
            }
    
            private static void SanitizeChildren(HtmlNode parentNode)
            {
                for (int i = parentNode.ChildNodes.Count - 1; i >= 0; i--)
                {
                    SanitizeNode(parentNode.ChildNodes[i]);
                }
            }
    
            private static void SanitizeNode(HtmlNode node)
            {
                if (node.NodeType == HtmlNodeType.Element)
                {
                    if (!Whitelist.ContainsKey(node.Name))
                    {
                        if (!DeletableNodesXpath.Contains(node.Name))
                        {                       
                            //DeletableNodesXpath.Add(node.Name.Replace("?",""));
                            node.Name = "removeableNode";
                            DeletableNodesXpath.Add(node.Name);
                        }
                        if (node.HasChildNodes)
                        {
                            SanitizeChildren(node);
                        }                  
    
                        return;
                    }
    
                    if (node.HasAttributes)
                    {
                        for (int i = node.Attributes.Count - 1; i >= 0; i--)
                        {
                            HtmlAttribute currentAttribute = node.Attributes[i];
                            string[] allowedAttributes = Whitelist[node.Name];
                            if (allowedAttributes != null)
                            {
                                if (!allowedAttributes.Contains(currentAttribute.Name))
                                {
                                    node.Attributes.Remove(currentAttribute);
                                }
                            }
                            else
                            {
                                node.Attributes.Remove(currentAttribute);
                            }
                        }
                    }
                }
    
                if (node.HasChildNodes)
                {
                    SanitizeChildren(node);
                }
            }
    
            private static string StripHtml(string html, string xPath)
            {
                HtmlDocument htmlDoc = new HtmlDocument();
                htmlDoc.LoadHtml(html);
                if (xPath.Length > 0)
                {
                    HtmlNodeCollection invalidNodes = htmlDoc.DocumentNode.SelectNodes(@xPath);
                    foreach (HtmlNode node in invalidNodes)
                    {
                        node.ParentNode.RemoveChild(node, true);
                    }
                }
                return htmlDoc.DocumentNode.WriteContentTo(); ;
            }
    
            private static string CreateXPath()
            {
                string _xPath = string.Empty;
                for (int i = 0; i < DeletableNodesXpath.Count; i++)
                {
                    if (i != DeletableNodesXpath.Count - 1)
                    {
                        _xPath += string.Format("//{0}|", DeletableNodesXpath[i].ToString());
                    }
                    else _xPath += string.Format("//{0}", DeletableNodesXpath[i].ToString());
                }
                return _xPath;
            }
        }
    

    一切都很简单。我采取了白名单的方法。因此,只需删除您不想要的所有标签和属性。使用HtmlSanitizer.Sanitize(@html)

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2010-10-13
      • 2010-10-27
      • 1970-01-01
      • 1970-01-01
      • 2017-06-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多