Python 3.x 用 Unicode 字符替换答案

【问题标题】：Python 3.x replacements with Unicode charactersPython 3.x 用 Unicode 字符替换
【发布时间】：2014-10-15 18:24:06
【问题描述】：

当下一行以小写字符开头时，我想用空格替换换行符。

我的 Python 3.3 代码在下一行以 [a-z] 开头时工作，但如果它以（小写）Unicode 字符开头则失败。

Test file (saved as UTF-8): Test to check<CRLF>whether it's working.<CRLF>Aquela é<CRLF>árvore pequena.<CRLF>

import os
input_file = 'test.txt'
input_file_path = os.path.join("c:", "\\", "Users", "Paulo", "workspace", "pdf_to_text", input_file)
input_string = open(input_file_path).read()

print(input_string)
import re

pattern = r'\n([a-zàáâãäåæçčèéêëěìíîïłðñńòóôõöøőřśŝšùúûüůýÿżžÞ]+)'
pattern_obj = re.compile(pattern)
replacement_string = " \\1"
output_string = pattern_obj.sub(replacement_string, input_string)
print(output_string)`

【问题讨论】：

测试文件的编码是什么？
"" 应该在pattern = 行上做什么？
另外，为什么要硬编码这个特定的小写字母子集，而不是使用 Unicode 字符类别 Ll 或适当的语言环境概念？
最重要的是：您能给我们一些示例输入数据吗？最简单的方法是给我们一个MCVE，它有input_string = "<some constant that demonstrates the problem>" 来代替从我们没有的文件中加载它的代码。
另外，您确定您的文本编辑器将此源代码保存为 UTF-8 而不是其他编码吗？你在 Windows 上，许多 Windows 编辑器默认使用不同的东西。

标签： python regex python-3.x unicode

【解决方案1】：

...当我读取（）文件时，原始文件中的unicode字符é和á分别更改为Ã©和Ã¡。

您的实际问题与正则表达式无关。您正在使用不正确的 latin-1 编码读取 utf-8 文本。

>>> print("é".encode('utf-8').decode('latin-1'))
Ã©
>>> print("á".encode('utf-8').decode('latin-1'))
Ã¡

读取 utf-8 文件：

with open(filename, encoding='utf-8') as file:
    text = file.read()

关于正则表达式的旧答案（与 OP 问题无关）：

一般来说，单个用户感知的字符（例如 ç、é）可能跨越多个 Unicode 代码点，因此 [çé] 可以单独匹配这些 Unicode 代码点，而不是匹配整个字符。 (?:ç|é) 将解决这一问题，还有其他问题，例如 Unicode 标准化（NFC、NFKD）。

当下一行以小写字符开头时，我想用空格替换换行符。

regex模块支持POSIX字符类[:lower:]：

import regex # $ pip install regex

text = ("Test to check\n"
        "whether it's working.\n"
        "Aquela \xe9\n"
        "\xe1rvore pequena.\n")
print(text)
# -> Test to check
# -> whether it's working.
# -> Aquela é
# -> árvore pequena.
print(regex.sub(r'\n(?=[[:lower:]])', ' ', text))
# -> Test to check whether it's working.
# -> Aquela é árvore pequena.

使用re 模块模拟[:lower:] 类：

import re
import sys
import unicodedata

# \p{Ll} chars
lower_chars = [u for u in map(chr, range(sys.maxunicode)) 
               if unicodedata.category(u) == 'Ll']
lower = "|".join(map(re.escape, lower_chars))
print(re.sub(r"\n(?={lower})".format(lower=lower), ' ', text))

结果是一样的。

【讨论】：