在字符串中显示不可打印的字符答案

【问题标题】：Show non printable characters in a string在字符串中显示不可打印的字符
【发布时间】：2012-12-18 07:00:13
【问题描述】：

是否可以用十六进制值可视化 python 字符串中的不可打印字符？

例如如果我有一个带有换行符的字符串，我想用 \x0a 替换它。

我知道有 repr() 会给我...\n，但我正在寻找十六进制版本。

【问题讨论】：

内置编解码器string_escape (s.encode('string_escape')) 几乎可以满足您的需求，除了\t、\r 和\n 之外，所有内容都为您提供十六进制，但不幸的是，到目前为止据我所知，没有什么内置的东西不会处理这三个特殊的......
嗯...对我不起作用。 “LookupError：未知编码：string_escape”我明白了。
对不起，string_escape 只存在于 2.x；你想要 3.x 中的 unicode_escape。但是除了使用\t、\r、\n之外，这将也转义所有字符> \u00ff（或者可能是> \u007f？我忘了……），这意味着它更少开箱即用，您可能会对它感到满意……（我将其作为评论而不是答案的原因是，鉴于您的主要观点是，我没想到您会对内置编解码器感到满意你想要\x0a 代替\n。）
附带说明，因为您在 3.x 中，并且使用 str 而不是 bytes 字符串：您想对不可打印的非 ASCII 字符做什么？将它们替换为 Unicode 转义符（例如，\u1234）或其他的？
不，我正在处理的字节显示为一个字节字符的字符串 + 将显示为十六进制的不可打印字符。

标签： python python-3.x escaping

【解决方案1】：

我不知道任何内置方法，但使用理解很容易做到：

import string
printable = string.ascii_letters + string.digits + string.punctuation + ' '
def hex_escape(s):
    return ''.join(c if c in printable else r'\x{0:02x}'.format(ord(c)) for c in s)

【讨论】：

这适用于 ASCII 字符串，但在 3.x 中，您不能指望字符串是 ASCII。而且处理 Unicode 也不是那么简单（虽然不是那么那么难）。
例如：hex_escape('a•')会返回'abc\\x2022'，这是不正确的（当你取消转义它会变成'a 22'）。
为什么不直接使用 string.printable ？
@GreenAsJade string.printable 包括换行符。
哇！ :) 当我来寻找自己问题的答案时，我对换行不感兴趣，所以我错过了……现在说得通了。

【解决方案2】：

我参加聚会有点晚了，但是如果您需要它来进行简单的调试，我发现这是可行的：

string = "\n\t\nHELLO\n\t\n\a\17"

procd = [c for c in string]

print(procd)

# Prints ['\n,', '\t,', '\n,', 'H,', 'E,', 'L,', 'L,', 'O,', '\n,', '\t,', '\n,', '\x07,', '\x0f,']

丑，但它帮助我在字符串中找到不可打印的字符。

【讨论】：

procd = list(string) 会比简单的列表理解更简洁。
对于控制代码字符（不可打印）很高兴能够看到它们 - chr(ord(c) + 9216) - en.wikipedia.org/wiki/Control_Pictures

【解决方案3】：

您必须手动进行翻译；例如，使用正则表达式遍历字符串，并用等效的十六进制替换每个出现。

import re

replchars = re.compile(r'[\n\r]')
def replchars_to_hex(match):
    return r'\x{0:02x}'.format(ord(match.group()))

replchars.sub(replchars_to_hex, inputtext)

上面的例子只匹配换行符和回车，但是你可以扩展匹配的字符，包括使用\x转义码和范围。

>>> inputtext = 'Some example containing a newline.\nRight there.\n'
>>> replchars.sub(replchars_to_hex, inputtext)
'Some example containing a newline.\\x0aRight there.\\x0a'
>>> print(replchars.sub(replchars_to_hex, inputtext))
Some example containing a newline.\x0aRight there.\x0a

【讨论】：

此版本比使用理解表达式在其他答案中建议的遍历字符串快 4 倍。要完全替换不可打印，请使用以下回复：re.compile('([^' + re.escape(string.printable) + '])')，或其他一些字符集（取决于您想要换行符等）
使用re.compile(r'[\x00-\x1f]') 只匹配控制字符。

【解决方案4】：

修改 ecatmur 的解决方案以处理不可打印的非 ASCII 字符使其变得不那么琐碎和令人讨厌：

def escape(c):
    if c.printable():
        return c
    c = ord(c)
    if c <= 0xff:
        return r'\x{0:02x}'.format(c)
    elif c <= '\uffff':
        return r'\u{0:04x}'.format(c)
    else:
        return r'\U{0:08x}'.format(c)

def hex_escape(s):
    return ''.join(escape(c) for c in s)

当然，如果str.isprintable 不是您想要的定义，您可以编写不同的函数。（请注意，它与 string.printable 中的集合非常不同——除了处理非 ASCII 可打印和不可打印字符外，它还考虑了 \n、\r、\t、\x0b 和 \x0c不可打印。

你可以让它更紧凑；这只是为了显示处理 Unicode 字符串所涉及的所有步骤。例如：

def escape(c):
    if c.printable():
        return c
    elif c <= '\xff':
        return r'\x{0:02x}'.format(ord(c))
    else:
        return c.encode('unicode_escape').decode('ascii')

真的，无论您做什么，您都必须明确处理\r、\n 和\t，因为我所知道的所有内置函数和stdlib 函数都会通过那些特殊的序列而不是它们的十六进制版本。

【讨论】：

【解决方案5】：

我曾经做过类似的事情，通过使用自定义 __repr__() 方法派生 str 子类来满足我的需求。这不是您要寻找的东西，但可能会给您一些想法。

# -*- coding: iso-8859-1 -*-

# special string subclass to override the default
# representation method. main purpose is to
# prefer using double quotes and avoid hex
# representation on chars with an ord > 128
class MsgStr(str):
    def __repr__(self):
        # use double quotes unless there are more of them within the string than
        # single quotes
        if self.count("'") >= self.count('"'):
            quotechar = '"'
        else:
            quotechar = "'"

        rep = [quotechar]
        for ch in self:
            # control char?
            if ord(ch) < ord(' '):
                # remove the single quotes around the escaped representation
                rep += repr(str(ch)).strip("'")
            # embedded quote matching quotechar being used?
            elif ch == quotechar:
                rep += "\\"
                rep += ch
            # else just use others as they are
            else:
                rep += ch
        rep += quotechar

        return "".join(rep)

if __name__ == "__main__":
    s1 = '\tWürttemberg'
    s2 = MsgStr(s1)
    print "str    s1:", s1
    print "MsgStr s2:", s2
    print "--only the next two should differ--"
    print "repr(s1):", repr(s1), "# uses built-in string 'repr'"
    print "repr(s2):", repr(s2), "# uses custom MsgStr 'repr'"
    print "str(s1):", str(s1)
    print "str(s2):", str(s2)
    print "repr(str(s1)):", repr(str(s1))
    print "repr(str(s2)):", repr(str(s2))
    print "MsgStr(repr(MsgStr('\tWürttemberg'))):", MsgStr(repr(MsgStr('\tWürttemberg')))

【讨论】：

【解决方案6】：

还有一种方法可以打印不可打印的字符，即使它们在字符串中作为命令执行，即使在字符串中不可见（透明），并且可以通过使用测量字符串的长度来观察它们的存在len 以及只需将鼠标光标放在字符串的开头并查看/计算您必须点击箭头键多少次才能从开始到结束，因为奇怪的是一些单个字符的长度可以为 3例如，这似乎令人困惑。（不确定这是否已经在之前的答案中得到证明）

在下面的示例屏幕截图中，我粘贴了一个 135 位的字符串，该字符串具有特定的结构和格式（我必须事先为某些位位置及其总长度手动创建），以便特定的将其解释为 ascii我正在运行的程序，并且在生成的打印字符串中是不可打印的字符，例如 ~~'line break`，它实际上会导致换行~~（更正：换页，我的意思是新页面，而不是行break) 在打印输出中，打印结果之间有一个额外的完整空白行（见下文）：

Example of printing non-printable characters that appear in printed string

Input a string:100100001010000000111000101000101000111011001110001000100001100010111010010101101011100001011000111011001000101001000010011101001000000
HPQGg]+\,vE!:@
>>> len('HPQGg]+\,vE!:@')
17
>>>

在上面的代码摘录中，尝试直接从该站点复制粘贴字符串HPQGg]+\,vE!:@，看看将其粘贴到 Python IDLE 时会发生什么。

提示：您必须点击箭头/光标三次才能跨越从P 到Q 的两个字母，即使它们彼此相邻，因为实际上有一个File Separator ascii 命令在它们之间。

然而，即使我们在将字节数组解码为十六进制时得到相同的起始值，如果我们将该十六进制转换回字节，它们看起来会有所不同（可能缺少编码，不确定），但无论哪种方式，上述输出的程序打印不可打印的字符（我在尝试开发压缩方法/实验时偶然发现了这个）。

>>> bytes(b'HPQGg]+\,vE!:@').hex()
'48501c514767110c5d2b5c2c7645213a40'
>>> bytes.fromhex('48501c514767110c5d2b5c2c7645213a40')
b'HP\x1cQGg\x11\x0c]+\\,vE!:@'

>>> (0x48501c514767110c5d2b5c2c7645213a40 == 0b100100001010000000111000101000101000111011001110001000100001100010111010010101101011100001011000111011001000101001000010011101001000000)
True
>>>

在上面的 135 位字符串中，前 16 组 8 位从大端开始编码每个字符（包括不可打印的），而最后一组 7 位产生@ 符号，如图所示下面：

Technical breakdown of the format of the above 135-bit string

这里的文本是 135 位字符串的细分：

10010000 = H (72)
10100000 = P (80)
00111000 = x1c (28 for File Separator) *
10100010 = Q (81)
10001110 = G(71)
11001110 = g (103)
00100010 = x11 (17 for Device Control 1) *
00011000 = x0c (12 for NP form feed, new page) *
10111010 = ] (93 for right bracket ‘]’
01010110 = + (43 for + sign)
10111000 = \ (92 for backslash)
01011000  = , (44 for comma, ‘,’)
11101100  = v (118)
10001010 = E (69)
01000010 = ! (33 for exclamation)
01110100 = : (58  for colon ‘:’)
1000000  =  @ (64 for ‘@’ sign)

因此，在结束时，关于将不可打印为十六进制的子问题的答案，在上面的字节数组中出现了字母x1c，它表示文件分隔符命令，该命令也在提示中注明。如果不包括左侧的前缀b，则字节数组可以被视为字符串，并且该值再次显示在打印字符串中，尽管它是不可见的（尽管可以通过提示和len 观察到它的存在，如上所示命令）。

【讨论】：