将字符偏移量转换为字节偏移量（在 Python 中）答案

【问题标题】：Converting character offsets into byte offsets (in Python)将字符偏移量转换为字节偏移量（在 Python 中）
【发布时间】：2014-06-02 17:00:16
【问题描述】：

假设我有一堆 UTF-8 格式的文件，我用 unicode 格式发送到外部 API。 API 对每个 unicode 字符串进行操作，并返回一个包含 (character_offset, substr) 元组的列表。

我需要的输出是每个找到的子字符串的开始和结束字节偏移量。如果幸运的话，输入文本只包含 ASCII 字符（使字符偏移量和字节偏移量相同），但情况并非总是如此。如何找到已知开始字符偏移量和子字符串的开始和结束字节偏移量？

我自己已经回答了这个问题，但期待其他更强大、更高效和/或更具可读性的解决方案。

【问题讨论】：

标签： python offset unicode-string

【解决方案1】：

我会使用字典来解决这个问题，将字符偏移量映射到字节偏移量，然后在其中查找偏移量。

def get_char_to_byte_map(unicode_string):
    """
    Generates a dictionary mapping character offsets to byte offsets for unicode_string.
    """
    response = {}
    byte_offset = 0
    for char_offset, character in enumerate(unicode_string):
        response[char_offset] = byte_offset
        byte_offset += len(character.encode('utf-8'))
    return response

char_to_byte_map = get_char_to_byte_map(text)

for begin_offset, substring in api_response:
    begin_offset = char_to_byte_map[character_offset]
    end_offset = char_to_byte_map[character_offset + len(substring)]
    # do something

与您的解决方案相比，此解决方案的性能很大程度上取决于输入的大小和所涉及的子字符串的数量。本地微基准测试表明，对文本中的每个单独字符进行编码所花费的时间大约是一次对整个文本进行编码的 1000 倍。

【讨论】：

不错的解决方案！因此，当有大量子字符串时，这会更快。不过，您在微基准测试中发现的差异惊人地高。我会看看我是否可以检查我正在处理的一些字符串的运行时。
我预计差异来自函数调用开销。我测试了一个长度约为 1000000（一百万）个字符的字符串。这使得它是一百万个函数调用与一个函数调用。

【解决方案2】：

为了在需要时将字符偏移量转换为字节偏移量，如果输入文本中有任何非 ASCII 字符，我 encode('utf8') 指向找到的子字符串的文本，并将其长度作为开始偏移量。

# Check if text contains non-ASCII characters
needs_offset_conversion = len(text) != len(text.encode('utf8'))

def get_byte_offsets(text, character_offset, substr, needs_conversion):
    if needs_conversion:
        begin_offset = len(text[:character_offset].encode('utf8'))
        end_offset = begin_offset + len(substr.encode('utf8'))
    else:
        begin_offset = character_offset
        end_offset = character_offset + len(substr)
    return begin_offset, end_offset

此实现有效，但它为每个找到的子字符串编码了（大部分）文本。

【讨论】：