ClickHouse 字符串的相关操作函数

楔子

下面来说一说字符串的相关操作。

empty：检测一个字符串是否为空，为空返回 1，不为空返回 0

notEmpty：检测一个字符串是否不为空，不为空返回 1，为空返回 0

SELECT empty(\'\'), empty(\'satori\');
/*
┌─empty(\'\')─┬─empty(\'satori\')─┐
│         1 │               0 │
└───────────┴─────────────────┘
*/

SELECT notEmpty(\'\'), notEmpty(\'satori\');
/*
┌─notEmpty(\'\')─┬─notEmpty(\'satori\')─┐
│            0 │                  1 │
└──────────────┴────────────────────┘
*/

length：计算一个字符串占多少个字节

char_length：计算一个字符串占多少个字符

WITH \'satori\' AS s1, \'古明地觉\' AS s2
SELECT length(s1), length(s2), char_length(s1), char_length(s2)
/*
┌─length(s1)─┬─length(s2)─┬─CHAR_LENGTH(s1)─┬─CHAR_LENGTH(s2)─┐
│          6 │         12 │               6 │               4 │
└────────────┴────────────┴─────────────────┴─────────────────┘
*/

toString：将整型、日期转成字符串

SELECT toString(3), cast(3 AS String);
/*
┌─toString(3)─┬─CAST(3, \'String\')─┐
│ 3           │ 3                 │
└─────────────┴───────────────────┘
*/

除了使用 cast 之外，每种数据类型都内置了相应的转换函数，格式为 to + 类型，比如 toInt8、toUInt32、toFloat64、toDecimal64 等等

lower、lcase：字符串转小写

upper、ucase：字符串转大写

SELECT lower(\'SAtoRI\'), upper(\'SAtoRI\');
/*
┌─lower(\'SAtoRI\')─┬─upper(\'SAtoRI\')─┐
│ satori          │ SATORI          │
└─────────────────┴─────────────────┘
*/

repeat：将字符串重复 n 次

SELECT repeat(\'abc\', 3);
/*
┌─repeat(\'abc\', 3)─┐
│ abcabcabc        │
└──────────────────┘
*/

reverse：将字符串翻转

SELECT reverse(\'satori\');
/*
┌─reverse(\'satori\')─┐
│ irotas            │
└───────────────────┘
*/

注意：reverse 是按照字节翻转的，这意味着它不能用在中文上面，如果想翻转中文，那么要使用 reverseUTF8，可以试一下。

format：格式化字符串

SELECT format(\'{}--{}\', \'hello\', \'world\');
/*
┌─format(\'{}--{}\', \'hello\', \'world\')─┐
│ hello--world                       │
└────────────────────────────────────┘
*/

-- {} 的数量和格式化的字符串数量要匹配，当然下面这种情况例外
SELECT format(\'{0}--{1}--{0}\', \'hello\', \'world\');
/*
┌─format(\'{0}--{1}--{0}\', \'hello\', \'world\')─┐
│ hello--world--hello                       │
└───────────────────────────────────────────┘
*/

concat：拼接字符串

SELECT concat(\'a\', \'b\', \'c\');
/*
┌─concat(\'a\', \'b\', \'c\')─┐
│ abc                   │
└───────────────────────┘
*/

当然拼接字符串还可以使用双竖线：

SELECT \'a\' || \'b\' || \'c\';
/*
┌─concat(\'a\', \'b\', \'c\')─┐
│ abc                   │
└───────────────────────┘
*/

substring：字符串截取，也可以写成 mid、substr，用法和标准 SQL 中的 substring 一样，但有一点区别

-- 从第 2 个元素开始截取，截取 3 个字节，注意：区别来了，截取的是字节
SELECT substring(\'abcdefg\', 2, 3);
/*
┌─substring(\'abcdefg\', 2, 3)─┐
│ bcd                        │
└────────────────────────────┘
*/

-- 如果想按照字符截取，要使用 substringUTF8

appendTrailingCharIfAbsent：如果非空字符串 s 的末尾不包含字符 c，那么就在 s 的结尾填上字符 c

SELECT appendTrailingCharIfAbsent(\'satori\', \'i\'), 
       appendTrailingCharIfAbsent(\'sator\', \'i\');
/*
┌─appendTrailingCharIfAbsent(\'satori\', \'i\')─┬─appendTrailingCharIfAbsent(\'sator\', \'i\')─┐
│ satori                                    │ satori                                   │
└───────────────────────────────────────────┴──────────────────────────────────────────┘
*/

convertCharset：改变字符串的字符集

SELECT convertCharset(\'satori\', \'ascii\', \'utf8\');
/*
┌─convertCharset(\'satori\', \'ascii\', \'utf8\')─┐
│ satori                                    │
└───────────────────────────────────────────┘
*/

base64Encode：对字符串进行 base64 编码

base64Decode：对 base64 编码的字符串进行 base64 解码

SELECT base64Encode(\'satori\') s1, base64Decode(s1);
/*
┌─s1───────┬─base64Decode(base64Encode(\'satori\'))─┐
│ c2F0b3Jp │ satori                               │
└──────────┴──────────────────────────────────────┘
*/

还有一个 tryBase64Decode，和 base64Decode 类似，但解析失败时会返回空字符串。如果是 base64Decode，那么对一个非 base64 编码的字符串解析会得到乱码。

startsWith、endsWith：判断字符串是否以某个子串开头或结尾，如果是，返回 1；否则，返回 0

SELECT startsWith(\'古明地觉\', \'古明\') v1, endsWith(\'古明地觉\', \'古明\') v2;
/*
┌─v1─┬─v2─┐
│  1 │  0 │
└────┴────┘
*/

trim：去除字符串两端的字符

SELECT trim(\'   satori    \') s, length(s);
/*
┌─s──────┬─length(trimBoth(\'   satori    \'))─┐
│ satori │                                 6 │
└────────┴───────────────────────────────────┘
*/

-- 默认去除空格，也可以去除其它字符
-- 但此时必须指定是从 "左边" 去除，还是从 "右边" 去除，还是 "两端" 都去除
-- 左边是 LEADING，右边是 TRAILING，两端是 BOTH
SELECT trim(BOTH \'ab\' FROM \'abxxxxxxbaaa\') s1,
       trim(LEADING \'ab\' FROM \'abxxxxxxbaaa\') s2,
       trim(TRAILING \'ab\' FROM \'abxxxxxxbaaa\') s3;
/*
┌─s1─────┬─s2─────────┬─s3───────┐
│ xxxxxx │ xxxxxxbaaa │ abxxxxxx │
└────────┴────────────┴──────────┘
*/

trim 如果只接收一个普通字符串，那么默认行为就是删除两端的空格，所以还有 trimLeft、trimRight，也是接收一个普通的字符串，然后去除左边、右边的空格。其中 trimLeft 也可以写作 ltrim，trimRight 也可以写作 rtrim。

CRC32：返回字符串的 CRC32 校验和，使用 CRC-32-IEEE 802.3 多项式，并且初始值为 0xFFFFFFFF

CRC32IEEE：返回字符串的 CRC32 校验和，使用 CRC-32-IEEE 802.3 多项式

CRC64：返回字符串的 CRC64 校验和，使用 CRC-64-ECMA 多项式

SELECT CRC32(\'satori\'), CRC32IEEE(\'satori\'), CRC64(\'satori\');
/*
┌─CRC32(\'satori\')─┬─CRC32IEEE(\'satori\')─┬─────CRC64(\'satori\')─┐
│       379058543 │          2807388364 │ 1445885890712067336 │
└─────────────────┴─────────────────────┴─────────────────────┘
*/

encodeXMLComponent：对字符串进行转义，针对 <、&、>、"、\' 五种符号

decodeXMLComponent：对字符串进行反转义，针对 <、&、>、"、\' 五种符号

SELECT encodeXMLComponent(\'<name>\');
/*
┌─encodeXMLComponent(\'<name>\')─┐
│ &lt;name&gt;                 │
└──────────────────────────────┘
*/

SELECT decodeXMLComponent(\'&lt;name&gt;\');
/*
┌─decodeXMLComponent(\'&lt;name&gt;\')─┐
│ <name>                             │
└────────────────────────────────────┘
*/

position：查找某个子串在字符串当中的位置

SELECT position(\'abcdefg\', \'de\');
/*
┌─position(\'abcdefg\', \'de\')─┐
│                         4 │
└───────────────────────────┘
*/

-- 也可以从指定位置查找
SELECT position(\'hello world\', \'o\', 1), position(\'hello world\', \'o\', 7);
/*
┌─position(\'hello world\', \'o\', 1)─┬─position(\'hello world\', \'o\', 7)─┐
│                               5 │                               8 │
└─────────────────────────────────┴─────────────────────────────────┘
*/

该函数是大小写敏感的，如果想大小写不敏感，那么可以使用 positionCaseInsensitive。还有一点需要注意，该函数是按照字节统计的。

position(\'古明地觉A\', \'A\') 得到的是 13，因为一个汉字 3 字节

如果包含中文，想按照字符统计，则需要使用 positionUTF8。

positionUTF8(\'古明地觉A\', \'A\') 得到的就是 5

如果不存在，则返回 0

multiSearchAllPositions：查找多个子串在字符串当中的位置，多个子串组成数组进行传递

SELECT multiSearchAllPositions(\'satori\', [\'sa\', \'to\', \'ri\', \'xxx\']);
/*
┌─multiSearchAllPositions(\'satori\', [\'sa\', \'to\', \'ri\', \'xxx\'])─┐
│ [1,3,5,0]                                                    │
└──────────────────────────────────────────────────────────────┘
*/

如果想大小写不敏感，那么可以使用 multiSearchAllPositionsCaseInsensitive。同样的，该函数也是在字节序列上进行搜索，不考虑字符编码，如果想支持非 ASCII 字符，应该使用 multiSearchAllPositionsUTF8。

match：正则表达式匹配，如果给定的字符串匹配给定的表达式，则返回 1；不匹配，则返回 0

-- 字符串放左边，模式方右边
SELECT match(\'123\', \'\\d{1,3}\'), match(\'abcd\', \'\\d{1,3}\');
/*
┌─match(\'123\', \'\\d{1,3}\')─┬─match(\'abcd\', \'\\d{1,3}\')─┐
│                        1 │                         0 │
└──────────────────────────┴───────────────────────────┘
*/

我们知道反斜杠本身代表转义，那么如果想表达 \d，应该使用 \\d。同理如果我们想检测字符串是否包含反斜杠，那么应该这么做：

SELECT match(s, \'\\\\\');

因为反斜杠具有转义，那么四个反斜杠会变成两个普通的反斜杠，但我们知道反斜杠在正则中也具有含义，所以两个反斜杠会变成一个普通的反斜杠。

multiMatchAny：正则表达式匹配，但可以接收多个模式，有一个能匹配上，则返回 1；全都匹配不上，则返回 0

SELECT match(\'satori\', \'xx\'), match(\'satori\', \'satori\');
/*
┌─match(\'satori\', \'xx\')─┬─match(\'satori\', \'satori\')─┐
│                     0 │                         1 │
└───────────────────────┴───────────────────────────┘
*/

SELECT multiMatchAny(\'satori\', [\'xx\', \'satori\']);
/*
┌─multiMatchAny(\'satori\', [\'xx\', \'satori\'])─┐
│                                         1 │
└───────────────────────────────────────────┘
*/

multiMatchAnyIndex：正则表达式匹配，接收多个模式，返回第一个匹配的模式的索引

-- 显然 \'satori\' 可以匹配上，而它的索引为 3
SELECT multiMatchAnyIndex(\'satori\', [\'yy\', \'xx\', \'satori\']);
/*
┌─multiMatchAnyIndex(\'satori\', [\'yy\', \'xx\', \'satori\'])─┐
│                                                    3 │
└──────────────────────────────────────────────────────┘
*/

如果没有一个能匹配上则返回 0，因为索引从 1 开始，所以返回 0 代表没有一个匹配上。像一般的编程语言，由于索引从 0 开始，那么当匹配不上的时候返回的就是 -1。

multiMatchAllIndices：正则表达式匹配，接收多个模式，返回所有匹配的模式的索引

-- 索引为 2、3 的模式都能匹配上，但只返回第一个匹配上的
SELECT multiMatchAnyIndex(\'satori\', [\'yy\', \'sa\', \'satori\']);
/*
┌─multiMatchAnyIndex(\'satori\', [\'yy\', \'sa\', \'satori\'])─┐
│                                                    2 │
└──────────────────────────────────────────────────────┘
*/


-- 返回所有匹配上的
SELECT multiMatchAllIndices(\'satori\', [\'yy\', \'sa\', \'satori\']);
/*
┌─multiMatchAllIndices(\'satori\', [\'yy\', \'sa\', \'satori\'])─┐
│ [2,3]                                                  │
└────────────────────────────────────────────────────────┘
*/

extract：返回使用正则表达式匹配的字符串

-- 我们看到匹配使用的是贪婪模式
SELECT extract(\'satori\', \'\\w{1,3}\');
/*
┌─extract(\'satori\', \'\\w{1,3}\')─┐
│ sat                           │
└───────────────────────────────┘
*/

-- 采用非贪婪模式
SELECT extract(\'satori\', \'\\w{1,3}?\');
/*
┌─extract(\'satori\', \'\\w{1,3}?\')─┐
│ s                              │
└────────────────────────────────┘
*/

匹配不上，则返回空字符串。

extractAll：extract 只返回一个匹配的字符串，extractAll 则返回所有的

SELECT extract(\'abc abd abe\', \'ab.\'), extractAll(\'abc abd abe\', \'ab.\');
/*
┌─extract(\'abc abd abe\', \'ab.\')─┬─extractAll(\'abc abd abe\', \'ab.\')─┐
│ abc                           │ [\'abc\',\'abd\',\'abe\']              │
└───────────────────────────────┴──────────────────────────────────┘
*/

extractAllGroupsHorizontal、extractAllGroupsVertical：匹配组，举例说明最直接

SELECT extractAllGroupsHorizontal(\'2020-01-05 2020-02-21 2020-11-13\', 
                                  \'(\\d{4})-(\\d{2})-(\\d{2})\');
/*
┌─extractAllGroupsHorizontal(\'2020-01-05 2020-02-21 2020-11-13\', \'(\\d{4})-(\\d{2})-(\\d{2})\')─┐
│ [[\'2020\',\'2020\',\'2020\'],[\'01\',\'02\',\'11\'],[\'05\',\'21\',\'13\']]                                   │
└──────────────────────────────────────────────────────────────────────────────────────────────┘
*/

SELECT extractAllGroupsVertical(\'2020-01-05 2020-02-21 2020-11-13\', 
                                \'(\\d{4})-(\\d{2})-(\\d{2})\');
/*
┌─extractAllGroupsVertical(\'2020-01-05 2020-02-21 2020-11-13\', \'(\\d{4})-(\\d{2})-(\\d{2})\')─┐
│ [[\'2020\',\'01\',\'05\'],[\'2020\',\'02\',\'21\'],[\'2020\',\'11\',\'13\']]                                 │
└────────────────────────────────────────────────────────────────────────────────────────────┘
*/

ClickHouse 在匹配组的时候也给了两种选择，我们在使用编程语言进行组匹配的时候，一般返回都是第二种。而且事实上，extractAllGroupsVertical 的速度比 extractAllGroupsHorizontal 要快一些。

当匹配不上的时候，返回的是空列表。

SELECT extractAllGroupsHorizontal(\'2020-01-05 2020-02-21 2020-11-13\', 
                                  \'(\\d{10})-(\\d{20})-(\\d{20})\');
/*
┌─extractAllGroupsHorizontal(\'2020-01-05 2020-02-21 2020-11-13\', \'(\\d{10})-(\\d{20})-(\\d{20})\')─┐
│ [[],[],[]]                                                                                      │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘
*/

SELECT extractAllGroupsVertical (\'2020-01-05 2020-02-21 2020-11-13\', 
                                 \'(\\d{10})-(\\d{20})-(\\d{20})\');
/*
┌─extractAllGroupsVertical(\'2020-01-05 2020-02-21 2020-11-13\', \'(\\d{10})-(\\d{20})-(\\d{20})\')─┐
│ []                                                                                            │
└───────────────────────────────────────────────────────────────────────────────────────────────┘
*/

extractAllGroupsHorizontal 相当于把多个组中按照顺序合并了，所以列表里面是 3 个空列表，因为我们匹配的组有三个。

like：where 语句里面有 LIKE，但 like 也是一个函数，两者规则是一样的

-- % 表示任意数量的任意字符；_ 表示单个任意字符
-- \ 表示转义
SELECT like(\'satori\', \'sa%\'), like(\'satori\', \'sa_\');

除了 like 之外，还有一个 notLike，以及不区分大小写的 ilike。

ngramDistance：计算两个字符串的相似度，取值为 0 到 1，越相似越接近 0

SELECT ngramDistance(\'satori\', \'satori\')
/*
┌─ngramDistance(\'satori\', \'satori\')─┐
│                                 0 │
└───────────────────────────────────┘
*/

注意：如果某个字符串的长度超过了 32 KB，那么结果直接为 1，就不再计算相似度了。该函数在计算字符串相似度的时候是大小写敏感的，如果想要忽略大小写，可以使用 ngramDistanceCaseInsensitive。同理如果针对中文，那么可以使用 ngramDistanceUTF8，以及 ngramDistanceCaseInsensitiveUTF8。

countSubstrings：计算字符串中某个字串出现的次数

SELECT countSubstrings(\'aaaa\', \'aa\'), countSubstrings(\'abc_abc\', \'abc\');
/*
┌─countSubstrings(\'aaaa\', \'aa\')─┬─countSubstrings(\'abc_abc\', \'abc\')─┐
│                             2 │                                 2 │
└───────────────────────────────┴───────────────────────────────────┘
*/

-- 从指定位置开始查找
SELECT countSubstrings(\'aabbaa\', \'aa\'), countSubstrings(\'aabbaa\', \'aa\', 3);
/*
┌─countSubstrings(\'aabbaa\', \'aa\')─┬─countSubstrings(\'aabbaa\', \'aa\', 3)─┐
│                               2 │                                  1 │
└─────────────────────────────────┴────────────────────────────────────┘
*/

如果希望大小写敏感，那么可以使用 countSubstringsCaseInsensitive，针对中文可以使用 countSubstringsCaseInsensitiveUTF8。

countMatches：计算字符串中某个模式匹配的次数

SELECT countSubstrings(\'aaabbaa\', \'aa\'), countMatches(\'aaabbaa\', \'a.\');
/*
┌─countSubstrings(\'aaabbaa\', \'aa\')─┬─countMatches(\'aaabbaa\', \'a.\')─┐
│                                2 │                             3 │
└──────────────────────────────────┴───────────────────────────────┘
*/

replaceOne：对字符串中指定的部分进行替换，但只会替换第一次出现的部分

SELECT replaceOne(\'hello cruel world, cruel\', \'cruel\', \'beautiful\');
/*
┌─replaceOne(\'hello cruel world, cruel\', \'cruel\', \'beautiful\')─┐
│ hello beautiful world, cruel                                 │
└──────────────────────────────────────────────────────────────┘
*/

如果想全部替换，那么可以使用 replaceAll：

SELECT replaceAll(\'hello cruel world, cruel\', \'cruel\', \'beautiful\');
/*
┌─replaceAll(\'hello cruel world, cruel\', \'cruel\', \'beautiful\')─┐
│ hello beautiful world, beautiful                             │
└──────────────────────────────────────────────────────────────┘
*/

replaceRegexpOne：对字符串中指定的部分进行替换，但支持正则

SELECT replaceRegexpOne(\'hello cruel world, cruel\', \'cru..\', \'beautiful\');
/*
┌─replaceRegexpOne(\'hello cruel world, cruel\', \'cru..\', \'beautiful\')─┐
│ hello beautiful world, cruel                                       │
└────────────────────────────────────────────────────────────────────┘
*/

如果想全部替换，那么可以使用 replaceRegexpAll：

SELECT replaceRegexpAll(\'hello cruel world, cruel\', \'cru..\', \'beautiful\');
/*
┌─replaceRegexpAll(\'hello cruel world, cruel\', \'cru..\', \'beautiful\')─┐
│ hello beautiful world, beautiful                                   │
└────────────────────────────────────────────────────────────────────┘
*/

splitByChar：将字符串按照指定字符进行分解，返回数组

-- 分隔符必须是单个字符
SELECT splitByChar(\'_\', \'ABC_def_fgh\');
/*
┌─splitByChar(\'_\', \'ABC_def_fgh\')─┐
│ [\'ABC\',\'def\',\'fgh\']             │
└─────────────────────────────────┘
*/

splitByString：将字符串按照指定字符（串）进行分解，返回数组

-- 分隔符必须是单个字符
SELECT splitByString(\'_\', \'ABC_def_fgh\'), splitByString(\'__\', \'ABC__def__fgh\');
/*
┌─splitByString(\'_\', \'ABC_def_fgh\')─┬─splitByString(\'__\', \'ABC__def__fgh\')─┐
│ [\'ABC\',\'def\',\'fgh\']               │ [\'ABC\',\'def\',\'fgh\']                  │
└───────────────────────────────────┴──────────────────────────────────────┘
*/

从这里可以看出 splitByString 完全可以取代 splitByChar，因为它既可以按照单个字符分解，也可以按照字符串分解，当然单个字符在 ClickHouse 里面也是字符串。但 ClickHouse 既然提供了两个函数，那么个人建议，如果是按照单个字符分解的话，还是使用 splitByChar。

splitByRegexp：将字符串按照正则的模式进行分解，返回数组

SELECT splitByRegexp(\'\\d+\', \'a12bc23de345f\');
/*
┌─splitByRegexp(\'\\d+\', \'a12bc23de345f\')─┐
│ [\'a\',\'bc\',\'de\',\'f\']                    │
└────────────────────────────────────────┘
*/

arrayStringConcat：将数组中的字符串进行拼接

SELECT arrayStringConcat([\'a\', \'b\', \'c\', \'d\'], \'--\');
/*
┌─arrayStringConcat([\'a\', \'b\', \'c\', \'d\'], \'--\')─┐
│ a--b--c--d                                    │
└───────────────────────────────────────────────┘
*/

小结

字符串算是非常常用的一个数据结构，它的操作自然也有很多，但都不是很难。