将字节解码为 unicode 字符串答案

【问题标题】：Decoding bytes as unicode string将字节解码为 unicode 字符串
【发布时间】：2013-11-26 06:41:20
【问题描述】：

问题是如何提取字符串，它表示为字符串中的字节（警告）？我的真正意思是：

>>> s1 = '\\xd0\\xb1'  #  But this is NOT bytes of s1! s1 should be 'б'!
'\\xd0\\xb1'
>>> s1[0]
'\\'
>>> len(s1)            #  The problem is here: I thought I would see (2), but:
8
>>> type(s1)
<class 'str'>
>>> type(s1[0])
<class 'str'>
>>> s1[0] == '\\'
True

那么我怎样才能将 s1 转换为 'б'（西里尔符号 - '\xd0\xb1' 的真实表示）。我已经在这里问了一个类似的问题，但是我的错误被误解为 s1 的真实代表视图（我认为 '\' 是 '\'，而不是 '\\')。

【问题讨论】：

标签： python unicode python-3.x encoding utf-8

【解决方案1】：

试试下面的代码。警告，这只是一个概念证明。当文本还包含写成非转义序列的字符时，替换必须以更复杂的方式完成（稍后我会在需要时展示）。请参阅下面的 cmets。

import binascii

s1 = '\\xd0\\xb1'
print('s1 =', repr(s1), '=', list(s1))            # list() to emphasize what are the characters

s2 = s1.replace('\\x', '')
print('s2 =', repr(s2))

b = binascii.unhexlify(s2)
print('b =', repr(b), '=', list(b))

s3 = b.decode('utf8')
print('s3 =', ascii(s3))

with open('output.txt', 'w', encoding='utf-8') as f:
    f.write(s3)

它在 concole 上打印：

c:\__Python\user\so20210201>py a.py
s1 = '\\xd0\\xb1' = ['\\', 'x', 'd', '0', '\\', 'x', 'b', '1']
s2 = 'd0b1'
b = b'\xd0\xb1' = [208, 177]
s3 = '\u0431'

并将字符写入output.txt 文件。

问题在于该问题结合了 unicode 转义和转义二进制值。换句话说，unicode 字符串可以包含一些以某种方式表示二进制值的序列；但是，您不能将二进制值直接强制转换为 unicode 字符串，因为任何 unicode 字符实际上都是抽象整数，并且整数可以用多种方式表示（作为字节序列）。

如果 unicode 字符串包含像 \\n 这样的转义序列，则可以使用 bytes.decode() 的“unicode_escape”处方以不同的方式完成。但是，在这种情况下，您需要先从 ascii 转义序列解码，然后再从 utf-8 解码。

更新：这是一个用于将您的字符串类型与其他 ascii 字符（即不仅是转义序列）转换的函数。该函数使用有限自动机——一开始可能看起来太复杂（实际上它只是冗长）。

def userDecode(s):
    status = 0
    lst = []                       # result as list of bytes as ints
    xx = None                      # variable for one byte escape conversion
    for c in s:                    # unicode character
        print(status, ' c ==', c)  ## just for debugging
        if status == 0:
            if c == '\\':
                status = 1         # escape sequence for a byte starts
            else:
                lst.append(ord(c)) # convert to integer

        elif status == 1:          # x expected
            assert(c == 'x')
            status = 2

        elif status == 2:          # first nibble expected
            xx = c
            status = 3

        elif status == 3:          # second nibble expected
            xx += c
            lst.append(int(xx, 16)) # this is a hex representation of int
            status = 0

    # Construct the bytes from the ordinal values in the list, and decode
    # it as UTF-8 string.
    return bytes(lst).decode('utf-8')


if __name__ == '__main__':

    s = userDecode('\\xd0\\xb1whatever')
    print(ascii(s))    # cannot be displayed on console that does not support unicode

    with open('output.txt', 'w', encoding='utf-8') as f:
        f.write(s)

还要查看生成的文件。删除调试打印。它在控制台上显示以下内容：

c:\__Python\user\so20210201>b.py
0  c == \
1  c == x
2  c == d
3  c == 0
0  c == \
1  c == x
2  c == b
3  c == 1
0  c == w
0  c == h
0  c == a
0  c == t
0  c == e
0  c == v
0  c == e
0  c == r
'\u0431whatever'

【讨论】：

非常感谢你，这个解决方案对我有用！
不客气 :) 无论如何，你是如何得到带有转义序列的字符串的？
有一个 Flask 服务器。消息（字符串）在服务器端由 RSA 密钥加密，并作为二进制数据返回......在字符串中（如示例中的 s1）。它是在客户端使用 Requests 包进行的。坏消息：我无法访问服务器资源，因此我无法更改用于发送加密消息的格式。更新：有几件事遗漏： 1. 消息在服务器上由 RSA 密钥加密； 2. 像二进制数据一样以字符串格式（如s1）发送给客户端； 3. 客户端收到并解密； 4. 结果类似于 s1。
我明白了。无论如何，这不是逃避传输内容的某种“众所周知”（不是我）的方式吗？如果是的话，可能会有一些模块用于此目的。

【解决方案2】：

>>> s1 = b'\xd0\xb1' 
>>> s1.decode("utf8")
'б'
>>> len(s1)
2

【讨论】：

你为什么在里面放一个b，为什么不用r作为原始字符串？
@GamesBrainiac 因为它不是原始字符串 - 反斜杠是有意义的。 b 使它成为一个字节字符串。 \xd0 是一个单字节，值为 0xD0。您可以将它们组合起来（使其成为原始字节字符串），但随后会触发与 OP 相同的错误。
我明白了。谢谢，我不知道这些是字节字符串。非常感谢 :) 有时来 python 聊天室，我相信我们都可以从你那里学到很多东西 :)
这可能是解决问题的方法，但理论上 s1 可以在侧码中声明（其他来源，来自互联网等）。问题不是如何将 len == 2 的 '\xd0\xb1' 转换为 'б'，而是如何将 len == 8 的 '\\xd0\\xb1' 转换为 'б'