(C#) 提高自定义 getBetweenAll 的速度答案

【问题标题】：(C#) Improving speed of custom getBetweenAll(C#) 提高自定义 getBetweenAll 的速度
【发布时间】：2016-05-08 08:57:07
【问题描述】：

我在c#中写了一个自定义扩展方法，是对扩展方法string[] getBetweenAll(string source, string startstring, string endstring);的改进

原来这个扩展方法找到了两个字符串之间的所有子字符串，例如：

string source = "<1><2><3><4>";
source.getBetweenAll("<", ">");
//output: string[] {"1", "2", "3", "4"}

但是如果你在开始的时候又出现了一个

string source = "<<1><2><3><4>";
source.getBetweenAll("<", ">");
//output: string[] {"<1><2><3><4"}

所以我重写了它以更准确并从“>”向后搜索以找到“

现在我让它工作了，但这里的问题是它太慢了，因为搜索方法每次出现都会跳过整个字符串的每个字符。你知道我怎样才能提高这个功能的速度吗？还是不可能？

这是目前为止的全部代码http://pastebin.com/JEZmyfSG 我在代码需要提高速度的地方添加了 cmets

public static List<int> IndexOfAll(this string main, string searchString)
{
    List<int> ret = new List<int>();
    int len = searchString.Length;
    int start = -len;
    while (true)
    {
        start = main.IndexOf(searchString, start + len);
        if (start == -1)
        {
            break;
        }
        else
        {
            ret.Add(start);
        }
    }
    return ret;
}

public static string[] getBetweenAll(this string main, string strstart, string strend, bool preserve = false)
{
    List<string> results = new List<string>();
    List<int> ends = main.IndexOfAll(strend);
    foreach (int end in ends)
    {
        int start = main.previousIndexOf(strstart, end);  //This is where it has to search the whole source string every time
        results.Add(main.Substring(start, end - start) + (preserve ? strend : string.Empty));
    }
    return results.ToArray();
}

//This is the slow function (depends on main.Length)
public static int previousIndexOf(this string main, string find, int offset)
{
    int wtf = main.Length ;
    int x = main.LastIndexOf(find, wtf);
    while (x > offset)
    {
        x = main.LastIndexOf(find, wtf);
        wtf -= 1;
    }
    return x;
}

我想另一种方法是 PreviousIndexOf(string, int searchfrom);会提高速度.. 像 IndexOf() 一样，除了向后和提供的起始偏移量

【问题讨论】：

是的，但很有趣！
是否可以使用已编译的正则表达式（例如 <([^>]*)>）来加快速度？
什么对你来说很慢，你期望什么结果是好的？
这里的微优化！您可以尝试将ends 更改为一个数组并使用for (int i = 0 etc 对其进行迭代，并将results 的预期最大大小传递给results 的List 构造函数。
有什么理由不为此使用正则表达式？

标签： c# string performance indexof

【解决方案1】：

作为原来的GetBetweenAll，我们可以使用正则表达式。为了只匹配封闭字符串的最短“内部”外观，我们必须对起始字符串使用负前瞻，并为内容使用非贪婪量词。

public static string[] getBetweenAll(this string main, 
    string strstart, string strend, bool preserve = false)
{
    List<string> results = new List<string>();

    string regularExpressionString = string.Format("{0}(((?!{0}).)+?){1}", 
        Regex.Escape(strstart), Regex.Escape(strend));
    Regex regularExpression = new Regex(regularExpressionString, RegexOptions.IgnoreCase);

    var matches = regularExpression.Matches(main);

    foreach (Match match in matches)
    {
        if (preserve)
        {
            results.Add(match.Value);
        }
        else
        {
            results.Add(match.Groups[1].Value);
        }
    }

    return results.ToArray();
}

【讨论】：

您应该给出您用于实际测试的文本并明确您的期望。这根据您在问题中的解释起作用。

【解决方案2】：

我写了一个比你快四倍的简单方法（但直到现在还没有preserve 参数）：

public static string[] getBetweenAll2(this string main, string strstart, string strend, bool preserve = false)
{
    List<string> results = new List<string>();

    int lenStart = strstart.Length;

    int indexStart = 0;
    while (true)
    {
        indexStart = main.IndexOf(strstart, indexStart);
        if (indexStart < 0)
            break;

        int indexEnd = main.IndexOf(strend, indexStart);

        if (indexEnd < 0)
            break;

        results.Add(main.Substring(indexStart+ lenStart, indexEnd- indexStart- lenStart));
        indexStart = indexEnd;
    }
    return results.ToArray();
}

这会给你1、2、3 和4 字符串4 中的数字<1><2><3><4>

这是你想要的吗？

[编辑]

查找嵌套的东西：

public static string[] getBetweenAll2(this string main, string strstart, string strend, bool preserve = false)
{
    List<string> results = new List<string>();

    int lenStart = strstart.Length; 
    int lenEnd = strend.Length;

    int index = 0;

    Stack<int> starPos = new Stack<int>();

    while (true)
    {
        int indexStart = main.IndexOf(strstart, index);
        int indexEnd = main.IndexOf(strend, index);

        if (indexStart != -1 && indexStart < indexEnd)
        {
            index = indexStart + lenStart;
            starPos.Push(index);
        }
        else if (indexEnd != -1 && (indexEnd < indexStart || indexStart == -1))
        {
            if (starPos.Count == 1)
            {
                int startOfInterst = starPos.Pop();
                results.Add(main.Substring(startOfInterst, indexEnd - startOfInterst));
            } else if(starPos.Count>0)
            {
                starPos.Pop();
            }
            index = indexEnd + lenEnd;
        }
        else
        {
            break;
        }
    }
    return results.ToArray();
}

【讨论】：

不是真的，它更快但是当我尝试抓取东西时它会打印出奇怪的结果。它应该在末尾搜索，比如说“.mp4”，直到它到达“http://”的第一次出现，以避免从整个字符串的开头和中间的所有内容都变成“http://”.. 真的很难解释，但我希望你能理解
@Ragnar 然后显示一些实际的示例输入和输出。

【解决方案3】：

我发现这可以满足我的需求，但以另一种方式！执行 PreviousIndexOf(string source, string token, int offset) 的函数对于其他内容仍将不胜感激！

public static List<string> GetBetweenAll(this string main, string start, string finish, bool preserve = false,  int index = 0)
{
    List<string> matches = new List<string>();
    Match gbMatch = new Regex(Regex.Escape(start) + "(.+?)" + Regex.Escape(finish)).Match(main, index);
    while (gbMatch.Success)
    {
        matches.Add((preserve ? start : string.Empty) + gbMatch.Groups[1].Value + (preserve ? finish : string.Empty));
        gbMatch = gbMatch.NextMatch();
    }
    return matches;
}
public static string[] getBetweenAllBackwards(this string main, string strstart, string strend, bool preserve = false)
{
    List<string> all = Reverse(main).GetBetweenAll(Reverse(strend), Reverse(strstart), preserve);
    for (int i = 0; i < all.Count; i++)
    {
        all[i] = Reverse(all[i]);
    }
    return all.ToArray();
}
public static string Reverse(string s)
{
    char[] charArray = s.ToCharArray();
    Array.Reverse(charArray);
    return new string(charArray);
}

【讨论】：

【解决方案4】：

使用栈来做。看到打开令牌后，立即开始向stack 添加字符。一旦您看到关闭令牌 - 从您的堆栈中弹出所有内容，这将是您感兴趣的角色。

现在，一旦您实现了基本案例，您就可以使用递归对其进行改进以使其工作。如果您在关闭令牌之前看到另一个打开令牌 - 开始将字符收集到新堆栈，直到您看到一个关闭令牌。

这会给您带来 O(N) 的复杂性，因为您只需要传递所有内容一次。

如果您在开始标记之前看到结束标记，您还需要处理这种情况，但从您的问题中不清楚程序应该做什么。

【讨论】：

问题是，它需要从每次出现">"
这不会改变任何事情 - 现在你的开始标记是“>”并且你从字符串的末尾转到开头。