【发布时间】:2015-08-03 22:23:52
【问题描述】:
我遇到了一个案例,SQL 服务器可以将“sofia”和“sofia”存储为两个不同的字符串,但是在 TSQL 中进行比较时,无论使用 COLLATE,即使是二进制 Collate,它们都是相同的:
CREATE TABLE #R (NAME NvarchAR(255) COLLATE SQL_Latin1_General_CP1_CI_AS)
INSERT INTO #R VALUES (N'sofia')
INSERT INTO #r VALUES (N'sofia')
SELECT * FROM #r WHERE NAME = N'sofia'
sofia
sofia
(2 row(s) affected)
IF 'sofia' = 'sofia' COLLATE SQL_Latin1_General_CP1_CI_AS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
-------------------
Values are the same
(1 row(s) affected)
IF 'sofia' = 'sofia' COLLATE SQL_Latin1_General_CP437_BIN
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
-------------------
Values are the same
(1 row(s) affected)
I tried to find out the encode of "sofia"
http://stackoverflow.com/questions/1025332/determine-a-strings-encoding-in-c-sharp
It said:
// If all else fails, the encoding is probably (though certainly not
// definitely) the user's local codepage! One might present to the user a
// list of alternative encodings as shown here: http://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language
// A full list can be found using Encoding.GetEncodings();
I iterate through all the encoding returned from Encoding.GetEncodings(), none of them match
Looking into the binary I found an interesting fact: “sofia” itself is encoded with UTF16, but it can be generated from "SOFIA" UTF16 by filling “1” instead of “0” in the extra byte besides ASCII code (Ex for ‘S’: 83 255 vs 83 0) It is shown as lower case. In C#,
“sofia”
[0] 83 byte
[1] 255 byte
[2] 79 byte
[3] 255 byte
[4] 70 byte
[5] 255 byte
[6] 73 byte
[7] 255 byte
[8] 65 byte
[9] 255 byte
"SOFIA"
[0] 83 byte
[1] 0 byte
[2] 79 byte
[3] 0 byte
[4] 70 byte
[5] 0 byte
[6] 73 byte
[7] 0 byte
[8] 65 byte
[9] 0 byte
"sofia"
[0] 115 byte
[1] 0 byte
[2] 79 byte
[3] 0 byte
[4] 70 byte
[5] 0 byte
[6] 105 byte
[7] 0 byte
[8] 97 byte
[9] 0 byte
One can create two different directorie/files with name as C:\sofia\, C:\sofia\ or sofia.txt, sofia.txt.
Why does the SQL engine think they are the same while storing them with the original streams?
In order to get just the exact I want I had to convert to binary first:
SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')
sofia
(1 row(s) affected)
SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')
sofia
(1 row(s) affected)
但这有很多副作用,比如文化和案例。我如何教 TSQL 引擎知道它们是不同的而无需太多成本?
这种字符串编码有正式名称吗?
【问题讨论】:
-
我很好奇我的回答是否能帮助您解决问题。
标签: c# sql-server unicode encoding collation