【问题标题】:Efficient code: Removes tags from string except the first one高效代码:从字符串中删除除第一个之外的标签
【发布时间】:2015-11-19 04:32:01
【问题描述】:

我写了一个函数,它在字符串中搜索给定的标签并删除所有这些标签及其内容,除了第一个:

Sub Main()
    Dim fileAsString = "<div>myFirstDiv</div>" +
                        "<Div></dIV>" +
                        "<city>NY</city>" +
                        "<city></city>" +
                        "<div></div>" +
                        "<span></span>"

    ' Removes these tags and their content from fileAsString, except the 
    ' first appearance

    Dim forbiddenNodeslist As New List(Of String)
    forbiddenNodeslist.Add("div")
    forbiddenNodeslist.Add("city")

    ' Run all over the forbidden tags

    For Each node In forbiddenNodeslist

        Dim re = New Regex("<" + node + "[^>]*>(.*?)</" + node + ">", RegexOptions.IgnoreCase)

        Dim matches = re.Matches(fileAsString)

        Dim matchesCount = matches.Count - 1

        ' Count the characters that were replaced by empty string, in order 
        ' to update the start index of the other matches

        Dim removedCharacters = 0

        ' Run all over the matches, except the first one

        For index = 1 To matches.Count - 1
            Dim match = matches(index)

            ' set start index and length in order to replace it by empty string

            Dim startIndex = match.Index - removedCharacters
            Dim matchCharactersCount = match.Length

            ' Update the number of characters that will be removed

            removedCharacters = matchCharactersCount

            ' Remove it from the string

            fileAsString = fileAsString.Remove(startIndex, matchCharactersCount)

        Next


    Next
end sub

但它效率低下,因为我搜索匹配项(字符串的第一个循环),然后一次又一次地循环以便用空字符串替换它。

如何提高效率?

任何帮助表示赞赏!

【问题讨论】:

  • 您是否有理由存储已删除的字符和已删除标签的位置?如果没有,这只是额外的开销。循环遍历您的违规标签列表以删除并使用单个语句删除/替换所有出现。 stackoverflow.com/questions/6025560/…
  • 是的,我存储它,因为当我删除一些字符串时,下一个匹配的开始索引需要更新。例如:“
    ”,第一个 div 出现在索引 0,第二个出现在 11,第三个出现在 22。当我删除第二个 div ,第三个 div 将位于索引 11 而不是 22。
  • 您可以反转整个字符串,然后只删除除最后一次出现的所有字符串,然后再次反转以获得相同的结果。

标签: .net vb.net


【解决方案1】:

所以我用 C# 回答了这个问题。你可以找到我用的小提琴here

public static void Main()
{
    var fileAsString = "<div>myFirstDiv</div><Div></dIV><city>NY</city><city></city><div></div><span></span>";

    //Using pipe delimited, this will come in handy for our second regex
    var delimetedForbiddenList = "div|city";

    //Use this regex to get everything that isn't the first tag
    var allButFirstTagRegex = new Regex(@"^(<([a-z]+)>[^</]*</\2>)(.*)", RegexOptions.IgnoreCase);
    var matches = allButFirstTagRegex.Matches(fileAsString);


    //matches[0].Groups[1] = (<([a-z]+)>[^</]*</\2>) -- the complete first 
    //tag (open, close, and inner), we'll use this later

    //matches[0].Groups[2] = ([a-zA-Z]+) --the first opening tag
    //used to get a matching close tag

    //matches[0].Groups[3] = (.*) -- everything not in the first tag        

    var allButFirstTag = matches[0].Groups[3].ToString();

    //allButFirstTag ==  @"<Div></dIV><city>NY</city><city></city><div></div><span></span>"

    //the regex to remove our forbidden tags
    var removeForbiddenPattern = String.Format("(<({0})>[^</]*</\\2>)", delimetedForbiddenList);
    //removeForbiddenPattern == new Regex(@"(<(div|city)>[^</]*</\2>)");

    var resultsWithForbiddenRemoved = Regex.Replace(allButFirstTag, removeForbiddenPattern, String.Empty, RegexOptions.IgnoreCase);
    //resultsWithForbiddenRemoved == @"<span></span>"

    var finalResults = matches[0].Groups[1].ToString() + resultsWithForbiddenRemoved;
    //finalResults = <div>myFirstDiv</div><span></span>

}

【讨论】:

    猜你喜欢
    • 2015-11-18
    • 1970-01-01
    • 2015-05-18
    • 2011-10-15
    • 2014-09-14
    • 1970-01-01
    • 2018-07-08
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多