试试下面的代码。警告,这只是一个概念证明。当文本还包含写成非转义序列的字符时,替换必须以更复杂的方式完成(稍后我会在需要时展示)。请参阅下面的 cmets。
import binascii
s1 = '\\xd0\\xb1'
print('s1 =', repr(s1), '=', list(s1)) # list() to emphasize what are the characters
s2 = s1.replace('\\x', '')
print('s2 =', repr(s2))
b = binascii.unhexlify(s2)
print('b =', repr(b), '=', list(b))
s3 = b.decode('utf8')
print('s3 =', ascii(s3))
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(s3)
它在 concole 上打印:
c:\__Python\user\so20210201>py a.py
s1 = '\\xd0\\xb1' = ['\\', 'x', 'd', '0', '\\', 'x', 'b', '1']
s2 = 'd0b1'
b = b'\xd0\xb1' = [208, 177]
s3 = '\u0431'
并将字符写入output.txt 文件。
问题在于该问题结合了 unicode 转义和转义二进制值。换句话说,unicode 字符串可以包含一些以某种方式表示二进制值的序列;但是,您不能将二进制值直接强制转换为 unicode 字符串,因为任何 unicode 字符实际上都是抽象整数,并且整数可以用多种方式表示(作为字节序列)。
如果 unicode 字符串包含像 \\n 这样的转义序列,则可以使用 bytes.decode() 的“unicode_escape”处方以不同的方式完成。但是,在这种情况下,您需要先从 ascii 转义序列解码,然后再从 utf-8 解码。
更新:这是一个用于将您的字符串类型与其他 ascii 字符(即不仅是转义序列)转换的函数。该函数使用有限自动机——一开始可能看起来太复杂(实际上它只是冗长)。
def userDecode(s):
status = 0
lst = [] # result as list of bytes as ints
xx = None # variable for one byte escape conversion
for c in s: # unicode character
print(status, ' c ==', c) ## just for debugging
if status == 0:
if c == '\\':
status = 1 # escape sequence for a byte starts
else:
lst.append(ord(c)) # convert to integer
elif status == 1: # x expected
assert(c == 'x')
status = 2
elif status == 2: # first nibble expected
xx = c
status = 3
elif status == 3: # second nibble expected
xx += c
lst.append(int(xx, 16)) # this is a hex representation of int
status = 0
# Construct the bytes from the ordinal values in the list, and decode
# it as UTF-8 string.
return bytes(lst).decode('utf-8')
if __name__ == '__main__':
s = userDecode('\\xd0\\xb1whatever')
print(ascii(s)) # cannot be displayed on console that does not support unicode
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(s)
还要查看生成的文件。删除调试打印。它在控制台上显示以下内容:
c:\__Python\user\so20210201>b.py
0 c == \
1 c == x
2 c == d
3 c == 0
0 c == \
1 c == x
2 c == b
3 c == 1
0 c == w
0 c == h
0 c == a
0 c == t
0 c == e
0 c == v
0 c == e
0 c == r
'\u0431whatever'