不区分大小写的 std::set 字符串答案

【问题标题】：Case insensitive std::set of strings不区分大小写的 std::set 字符串
【发布时间】：2023-03-15 05:50:01
【问题描述】：

如何在 std::set 中进行不区分大小写的插入或搜索字符串？

例如-

std::set<std::string> s;
s.insert("Hello");
s.insert("HELLO"); //not allowed, string already exists.

【问题讨论】：

你能澄清一下“区分大小写的插入”是什么意思吗？

标签： c++ stl

【解决方案1】：

您需要定义一个自定义比较器：

struct InsensitiveCompare { 
    bool operator() (const std::string& a, const std::string& b) const {
        return strcasecmp(a.c_str(), b.c_str()) < 0;
    }
};

std::set<std::string, InsensitiveCompare> s;

如果strcasecmp 不可用，您可以尝试stricmp 或strcoll。

【讨论】：

当我读到 InsensitiveCompare 时，我不禁想起了我的岳母。 +1。
如果字符串中有 NUL 字符，此方法将无法正常工作。见this question。
stricmp 不是标准的 C 或 C++，也不是 POSIX 或 ANSI。它在 GCC 中不可用，但 strcasecmp 至少是标准 POSIX 并且具有相同的签名。不过，使用std::tolower() 编写自己的实现很容易。

【解决方案2】：

std::set 提供了提供您自己的比较器的可能性（就像大多数 std 容器一样）。然后，您可以执行您喜欢的任何类型的比较。完整示例可用here

【讨论】：

【解决方案3】：

这是一个通用的解决方案，也适用于除std::string 之外的其他字符串类型（使用std::wstring、std::string_view、char const* 测试）。基本上任何定义 range 字符的东西都应该可以工作。

这里的关键是使用boost::as_literal，它允许我们在比较器中统一处理以空结尾的字符数组、字符指针和范围。

通用代码（“iset.h”）：

#pragma once
#include <set>
#include <algorithm>
#include <boost/algorithm/string.hpp>
#include <boost/range/as_literal.hpp>

// Case-insensitive generic string comparator.
struct range_iless
{
    template< typename InputRange1, typename InputRange2 >
    bool operator()( InputRange1 const& r1, InputRange2 const& r2 ) const 
    {
        // include the standard begin() and end() aswell as any custom overloads for ADL
        using std::begin; using std::end;  

        // Treat null-terminated character arrays, character pointers and ranges uniformly.
        // This just creates cheap iterator ranges (it doesn't copy container arguments)!
        auto ir1 = boost::as_literal( r1 );
        auto ir2 = boost::as_literal( r2 );

        // Compare case-insensitively.
        return std::lexicographical_compare( 
            begin( ir1 ), end( ir1 ), 
            begin( ir2 ), end( ir2 ), 
            boost::is_iless{} );
    }
};

// Case-insensitive set for any Key that consists of a range of characters.
template< class Key, class Allocator = std::allocator<Key> >
using iset = std::set< Key, range_iless, Allocator >;

使用示例（“main.cpp”）：

#include "iset.h"  // above header file
#include <iostream>
#include <string>
#include <string_view>

// Output range to stream.
template< typename InputRange, typename Stream, typename CharT >
void write_to( Stream& s, InputRange const& r, CharT const* sep )
{
    for( auto const& elem : r )
        s << elem << sep;
    s << std::endl;
}

int main()
{
    iset< std::string  >     s1{  "Hello",  "HELLO",  "world" };
    iset< std::wstring >     s2{ L"Hello", L"HELLO", L"world" };
    iset< char const*  >     s3{  "Hello",  "HELLO",  "world" };
    iset< std::string_view > s4{  "Hello",  "HELLO",  "world" };

    write_to( std::cout,  s1,  " " );    
    write_to( std::wcout, s2, L" " );    
    write_to( std::cout,  s3,  " " );    
    write_to( std::cout,  s4,  " " );    
}

Live Demo at Coliru

【讨论】：

【解决方案4】：

根据我的阅读，这比 stricmp() 更可移植，因为 stricmp() 实际上不是 std 库的一部分，而是仅由大多数编译器供应商实现。因此，下面是我自己推出的解决方案。

#include <string>
#include <cctype>
#include <iostream>
#include <set>

struct caseInsensitiveLess
{
  bool operator()(const std::string& x, const std::string& y)
  {
    unsigned int xs ( x.size() );
    unsigned int ys ( y.size() );
    unsigned int bound ( 0 );

    if ( xs < ys ) 
      bound = xs; 
    else 
      bound = ys;

    {
      unsigned int i = 0;
      for (auto it1 = x.begin(), it2 = y.begin(); i < bound; ++i, ++it1, ++it2)
      {
        if (tolower(*it1) < tolower(*it2))
          return true;

        if (tolower(*it2) < tolower(*it1))
          return false;
      }
    }
    return false; 
  }
};

int main()
{
  std::set<std::string, caseInsensitiveLess> ss1;
  std::set<std::string> ss2;

  ss1.insert("This is the first string");
  ss1.insert("THIS IS THE FIRST STRING");
  ss1.insert("THIS IS THE SECOND STRING");
  ss1.insert("This IS THE SECOND STRING");
  ss1.insert("This IS THE Third");

  ss2.insert("this is the first string");
  ss2.insert("this is the first string");
  ss2.insert("this is the second string");
  ss2.insert("this is the second string");
  ss2.insert("this is the third");

  for ( auto& i: ss1 )
   std::cout << i << std::endl;

  std::cout << std::endl;

  for ( auto& i: ss2 )
   std::cout << i << std::endl;

}

不区分大小写集和常规集的输出显示相同订购：

This is the first string
THIS IS THE SECOND STRING
This IS THE Third

this is the first string
this is the second string
this is the third

【讨论】：

一个小评论：如果你正在使用例如希腊文本，这是行不通的，因为通常不可能单独与tolower 进行不区分大小写的比较。这种不可能的教科书示例是 ΌΣΟΣ，它不区分大小写，与 όσος（希腊语中的“as many as”）相同。