String.comparison 性能（带修剪）答案

【问题标题】：String.comparison performance (with trim)String.comparison 性能（带修剪）
【发布时间】：2010-12-24 03:43:29
【问题描述】：

我需要做很多高性能的不区分大小写的字符串比较，并意识到我这样做的方式 .ToLower().Trim() 真的很愚蠢，因为所有新字符串都被分配了

所以我挖了一点，这种方式似乎更可取：

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

这里唯一的问题是我想忽略前导或尾随空格，即 Trim() 但如果我使用 Trim，我在字符串分配方面也会遇到同样的问题。我想我可以检查每个字符串，看看它是 StartsWith(" ") 还是 EndsWith(" ")，然后才修剪。要么找出每个字符串的索引、长度并传递给字符串。比较覆盖

public static int Compare
(
    string strA,
    int indexA,
    string strB,
    int indexB,
    int length,
    StringComparison comparisonType
)

但这看起来相当混乱，如果我不为两个字符串上的尾随和前导空格的每个组合创建一个非常大的 if-else 语句，我可能不得不使用一些整数......所以有什么优雅的解决方案的想法吗？

这是我目前的建议：

public bool IsEqual(string a, string b)
    {
        return (string.Compare(a, b, StringComparison.OrdinalIgnoreCase) == 0);
    }

    public bool IsTrimEqual(string a, string b)
    {
        if (Math.Abs(a.Length- b.Length) > 2 ) // if length differs by more than 2, cant be equal
        {
            return  false;
        }
        else if (IsEqual(a,b))
        {
            return true;
        }
        else 
        {
            return (string.Compare(a.Trim(), b.Trim(), StringComparison.OrdinalIgnoreCase) == 0);
        }
    }

【问题讨论】：

是什么让你觉得有问题？过早的优化是个坏主意——在你的应用程序变得“太慢”之前不需要优化。同时，专注于清晰的代码而不是快速的代码。
你能确定编译器没有为你优化这种情况吗？
我还想问这是否真的需要微优化？你在这方面真的有性能问题吗？我想在其他领域你可以在性能上获得更大的提升
它是针对一个非常大的字符串集的搜索引擎，所以我认为在这种情况下进行优化是相关的。此外，在自己的工具箱中拥有一个很好的字符串比较方法并不是一件坏事
@Anon ：我不认为这是过早的优化。如果有大量字符串，则为每次比较创建新的字符串实例可能需要更长的时间。只需运行一些测试，然后自己看看...

标签： c# string string-comparison

【解决方案1】：

应该这样做：

public static int TrimCompareIgnoreCase(string a, string b) {
   int indexA = 0;
   int indexB = 0;
   while (indexA < a.Length && Char.IsWhiteSpace(a[indexA])) indexA++;
   while (indexB < b.Length && Char.IsWhiteSpace(b[indexB])) indexB++;
   int lenA = a.Length - indexA;
   int lenB = b.Length - indexB;
   while (lenA > 0 && Char.IsWhiteSpace(a[indexA + lenA - 1])) lenA--;
   while (lenB > 0 && Char.IsWhiteSpace(b[indexB + lenB - 1])) lenB--;
   if (lenA == 0 && lenB == 0) return 0;
   if (lenA == 0) return 1;
   if (lenB == 0) return -1;
   int result = String.Compare(a, indexA, b, indexB, Math.Min(lenA, lenB), true);
   if (result == 0) {
      if (lenA < lenB) result--;
      if (lenA > lenB) result++;
   }
   return result;
}

例子：

string a = "  asdf ";
string b = " ASDF \t   ";

Console.WriteLine(TrimCompareIgnoreCase(a, b));

输出：

你应该根据一个简单的修剪和比较一些真实数据来分析它，看看你将要使用它是否真的有任何区别。

【讨论】：

有趣，谢谢！我将与不同的方法进行一些比较，看看哪一种是最重要的
@konrad 将此解决方案与 Trim 进行比较的结果如何？

【解决方案2】：

我会使用你的代码

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

并根据需要添加任何.Trim() 调用。这将在大多数情况下保存您的初始选项 4 个字符串（.ToLower().Trim()，并始终保存两个字符串（.ToLower()）。

如果您在此之后遇到性能问题，那么您的“混乱”选项可能是最好的选择。

【讨论】：

这很有趣。 Mattias：如果您的大多数字符串不需要 trim() 调用，那么您通常可以这样做，如果字符串不匹配，则回退并尝试使用 trim() 调用，然后“真的”返回它们不匹配。
嗯，在那种情况下，我想我应该运行一些测试，看看所需的 IsPrefix()/IsSuffix()（其中四个）是否比简单地进行 Trim 的性能更高或更低
啊当然！首先进行比较，然后进行修剪比较（或混乱的方法），如果不是 0，很好

【解决方案3】：

首先确保您确实需要优化此代码。也许创建字符串的副本不会显着影响您的程序。

如果你真的需要优化，你可以尝试在第一次存储字符串时而不是在比较它们时处理它们（假设它发生在程序的不同阶段）。例如，存储字符串的修剪和小写版本，以便在比较它们时可以使用简单的等价检查。

【讨论】：

好吧，在这种情况下使用更有效的方法并没有错。使用 String.Compare 并不是什么“聪明”的技巧，它是一种比较字符串的内置方法，也比调用 ToUpper().ToLower() 更有效。它的意图也更清楚，所以我认为你不能在这种情况下做出有效的“过早优化”案例/
我认为你的意思是 Trim().ToLower()

【解决方案4】：

您不能只修剪（并可能使其小写）每个字符串一次（获取它时）吗？做更多听起来像是过早的优化......

【讨论】：

当然在某些情况下我可以这样做，只是想看看是否可以提出一种优化的通用方法来做到这一点

【解决方案5】：

问题是，如果需要完成，就必须完成。我认为您的任何不同解决方案都不会产生影响。在每种情况下，都需要进行多次比较才能找到或删除空格。

显然，删除空格是问题的一部分，因此您不必担心。
如果您使用 unicode 字符并且可能比复制字符串慢，那么在比较之前将字符串小写是一个错误。

【讨论】：

【解决方案6】：

关于过早优化的警告是正确的，但我假设您已经对此进行了测试，发现复制字符串浪费了很多时间。在这种情况下，我会尝试以下方法：

int startIndex1, length1, startIndex2, length2;
FindStartAndLength(txt1, out startIndex1, out length1);
FindStartAndLength(txt2, out startIndex2, out length2);

int compareLength = Math.Max(length1, length2);
int result = string.Compare(txt1, startIndex1, txt2, startIndex2, compareLength);

FindStartAndLength 是一个查找“修剪”字符串的起始索引和长度的函数（未经测试，但应该给出大致的思路）：

static void FindStartAndLength(string text, out int startIndex, out int length)
{
    startIndex = 0;
    while(char.IsWhiteSpace(text[startIndex]) && startIndex < text.Length)
        startIndex++;

    length = text.Length - startIndex;
    while(char.IsWhiteSpace(text[startIndex + length - 1]) && length > 0)
        length--;
}

【讨论】：

【解决方案7】：

您可以实现自己的StringComparer。这是一个基本的实现：

public class TrimmingStringComparer : StringComparer
{
    private StringComparison _comparisonType;

    public TrimmingStringComparer()
        : this(StringComparison.CurrentCulture)
    {
    }

    public TrimmingStringComparer(StringComparison comparisonType)
    {
        _comparisonType = comparisonType;
    }

    public override int Compare(string x, string y)
    {
        int indexX;
        int indexY;
        int lengthX = TrimString(x, out indexX);
        int lengthY = TrimString(y, out indexY);

        if (lengthX <= 0 && lengthY <= 0)
            return 0; // both strings contain only white space

        if (lengthX <= 0)
            return -1; // x contains only white space, y doesn't

        if (lengthY <= 0)
            return 1; // y contains only white space, x doesn't

        if (lengthX < lengthY)
            return -1; // x is shorter than y

        if (lengthY < lengthX)
            return 1; // y is shorter than x

        return String.Compare(x, indexX, y, indexY, lengthX, _comparisonType);
    }

    public override bool Equals(string x, string y)
    {
        return Compare(x, y) == 0;
    }

    public override int GetHashCode(string obj)
    {
        throw new NotImplementedException();
    }

    private int TrimString(string s, out int index)
    {
        index = 0;
        while (index < s.Length && Char.IsWhiteSpace(s, index)) index++;
        int last = s.Length - 1;
        while (last >= 0 && Char.IsWhiteSpace(s, last)) last--;
        return last - index + 1;
    }
}

备注：

它未经广泛测试，可能包含错误
性能尚待评估（但它可能比调用Trim 和ToLower 更好）
GetHashCode 方法没有实现，所以不要将它用作字典中的键

【讨论】：

【解决方案8】：

我注意到您的第一个建议只比较相等而不是排序，这样可以进一步节省一些效率。

public static bool TrimmedOrdinalIgnoreCaseEquals(string x, string y)
{
    //Always check for identity (same reference) first for
    //any comparison (equality or otherwise) that could take some time.
    //Identity always entails equality, and equality always entails
    //equivalence.
    if(ReferenceEquals(x, y))
        return true;
    //We already know they aren't both null as ReferenceEquals(null, null)
    //returns true.
    if(x == null || y == null)
        return false;
    int startX = 0;
    //note we keep this one further than the last char we care about.
    int endX = x.Length;
    int startY = 0;
    //likewise, one further than we care about.
    int endY = y.Length;
    while(startX != endX && char.IsWhiteSpace(x[startX]))
        ++startX;
    while(startY != endY && char.IsWhiteSpace(y[startY]))
        ++startY;
    if(startX == endX)      //Empty when trimmed.
        return startY == endY;
    if(startY == endY)
        return false;
    //lack of bounds checking is safe as we would have returned
    //already in cases where endX and endY can fall below zero.
    while(char.IsWhiteSpace(x[endX - 1]))
        --endX;
    while(char.IsWhiteSpace(y[endY - 1]))
        --endY;
    //From this point on I am assuming you do not care about
    //the complications of case-folding, based on your example
    //referencing the ordinal version of string comparison
    if(endX - startX != endY - startY)
        return false;
    while(startX != endX)
    {
        //trade-off: with some data a case-sensitive
        //comparison first
        //could be more efficient.
        if(
            char.ToLowerInvariant(x[startX++])
            != char.ToLowerInvariant(y[startY++])
        )
            return false;
    }
    return true;
}

当然，没有匹配的哈希码生成器的相等检查器是什么：

public static int TrimmedOrdinalIgnoreCaseHashCode(string str)
{
    //Higher CMP_NUM (or get rid of it altogether) gives
    //better hash, at cost of taking longer to compute.
    const int CMP_NUM = 12;
    if(str == null)
        return 0;
    int start = 0;
    int end = str.Length;
    while(start != end && char.IsWhiteSpace(str[start]))
        ++start;
    if(start != end)
        while(char.IsWhiteSpace(str[end - 1]))
            --end;

    int skipOn = (end - start) / CMP_NUM + 1;
    int ret = 757602046; // no harm matching native .NET with empty string.
    while(start < end)
    {
            //prime numbers are our friends.
        ret = unchecked(ret * 251 + (int)(char.ToLowerInvariant(str[start])));
        start += skipOn;
    }
    return ret;
}

【讨论】：