如何找出内部字符串编码？答案

【问题标题】：How to find out internal string encoding?如何找出内部字符串编码？
【发布时间】：2017-09-18 20:42:27
【问题描述】：

来自PEP 393我了解到Python在存储字符串时可以在内部使用多种编码：latin1、UCS-2、UCS-4。是否可以找出用于存储特定字符串的编码，例如在交互式解释器中？

【问题讨论】：

您能否详细说明您为什么要这样做？除非您试图解构解释器的内部工作原理（在这种情况下，我会说只看源代码），否则这样做似乎很奇怪。
也许有 ctypes hack？
@David：我认为可能对估计所需空间或调试有用。
我猜...嗯，我认为我要问的主要内容是，您是否尝试在 Python 中动态执行此操作？还是查看解释器源代码是一个有效的解决方案？在前一种情况下，我不确定是否可能；在后一种情况下，它可能取决于实现。

标签： python string python-3.x encoding python-internals

【解决方案1】：

有一个用于 unicode 对象类型的 CPython C API 函数：PyUnicode_KIND。

如果您有 Cython 和 IPython¹，您可以轻松访问该功能：

In [1]: %load_ext cython
   ...:

In [2]: %%cython
   ...:
   ...: cdef extern from "Python.h":
   ...:     int PyUnicode_KIND(object o)
   ...:
   ...: cpdef unicode_kind(astring):
   ...:     if type(astring) is not str:
   ...:         raise TypeError('astring must be a string')
   ...:     return PyUnicode_KIND(astring)

In [3]: a = 'a'
   ...: b = 'Ǧ'
   ...: c = '?'

In [4]: unicode_kind(a), unicode_kind(b), unicode_kind(c)
Out[4]: (1, 2, 4)

其中1 代表latin-1 和2 和4 分别代表UCS-2 和UCS-4。

然后您可以使用字典将这些数字映射为表示编码的字符串。

¹ 没有 Cython 和/或 IPython 也是可能的，这种组合非常方便，否则将需要更多代码（没有 IPython）和/或需要手动安装（没有 Cython）。

【讨论】：

【解决方案2】：

您可以从 Python 层对此进行测试的唯一方法（无需通过ctypes 或 Python 扩展模块手动处理对象内部）是检查字符串中最大字符的序数值，该值确定是否字符串存储为 ASCII/latin-1、UCS-2 或 UCS-4。解决方案类似于：

def get_bpc(s):
    maxordinal = ord(max(s, default='\0'))
    if maxordinal < 256:
        return 1
    elif maxordinal < 65536:
        return 2
    else:
        return 4

您实际上不能依赖 sys.getsizeof，因为对于非 ASCII 字符串（即使每个字符串一个字节适合 latin-1 范围），该字符串可能已填充或未填充 UTF-8 表示字符串，以及向其添加额外字符和比较大小等技巧实际上可以显示大小减小，并且它实际上可能发生在“远处”，因此您不直接负责您正在检查的字符串上是否存在缓存的 UTF-8 格式。例如：

>>> e = 'é'
>>> sys.getsizeof(e)
74
>>> sys.getsizeof(e + 'a')
75
>>> class é: pass  # One of several ways to trigger creation/caching of UTF-8 form
>>> sys.getsizeof(e)
77  # !!! Grew three bytes even though it's the same variable
>>> sys.getsizeof(e + 'a')
75  # !!! Adding a character shrunk the string!

【讨论】：

【解决方案3】：

找出 CPython 用于特定 unicode 字符串的确切内部编码的一种方法是查看实际 (CPython) 对象。

根据PEP 393（Specification部分），所有的unicode字符串对象都以PyASCIIObject开头：

typedef struct {
  PyObject_HEAD
  Py_ssize_t length;
  Py_hash_t hash;
  struct {
      unsigned int interned:2;
      unsigned int kind:2;
      unsigned int compact:1;
      unsigned int ascii:1;
      unsigned int ready:1;
  } state;
  wchar_t *wstr;
} PyASCIIObject;

字符大小存储在 kind 位域中，如 PEP 和 code comments in unicodeobject 中所述：

00 => str is not initialized (data are in wstr)
01 => 1 byte (Latin-1)
10 => 2 byte (UCS-2)
11 => 4 byte (UCS-4);

在我们得到id(string)字符串的地址后，我们可以使用ctypes模块读取对象的字节（和kind字段）：

import ctypes
mystr = "x"
first_byte = ctypes.c_uint8.from_address(id(mystr)).value

从对象开始到kind 的偏移量是PyObject_HEAD + Py_ssize_t length + Py_hash_t hash，这又是Py_ssize_t ob_refcnt + 指向ob_type 的指针+ Py_ssize_t length + 另一个指针的大小哈希类型：

offset = 2 * ctypes.sizeof(ctypes.c_ssize_t) + 2 * ctypes.sizeof(ctypes.c_void_p)

（在 x64 上是 32）

全部放在一起：

import ctypes

def bytes_per_char(s):
    offset = 2 * ctypes.sizeof(ctypes.c_ssize_t) + 2 * ctypes.sizeof(ctypes.c_void_p)
    kind = ctypes.c_uint8.from_address(id(s) + offset).value >> 2 & 3
    size = {0: ctypes.sizeof(ctypes.c_wchar), 1: 1, 2: 2, 3: 4}
    return size[kind]

给予：

>>> bytes_per_char('test')
1
>>> bytes_per_char('đžš')
2
>>> bytes_per_char('?')
4

请注意，我们必须处理 kind == 0 的特殊情况，因为字符类型正好是 wchar_t（16 位或 32 位，具体取决于平台）。

【讨论】：