从 unicode 字符串中去除特殊字符和标点符号答案

【问题标题】：Strip special characters and punctuation from a unicode string从 unicode 字符串中去除特殊字符和标点符号
【发布时间】：2016-02-20 14:51:53
【问题描述】：

我正在尝试从 unicode 字符串中删除标点符号，该字符串可能包含非 ascii 字母。我尝试使用regex 模块：

import regex
text = u"<Üäik>"
regex.sub(ur"\p{P}+", "", text)

但是，我注意到字符 < 和 > 不会被删除。有谁知道为什么，还有其他方法可以从 unicode 字符串中去除标点符号吗？

编辑：我尝试过的另一种方法是：

import string
text = text.encode("utf8").translate(None, string.punctuation).decode("utf8")

但我想避免将文本从 unicode 转换为字符串并向后转换。

【问题讨论】：

你应该定义什么是标点符号。特别是在 unicode 中，这可能是非常多的字符和字符组合，具体取决于您的语言。
使用unicode.translate() 时无需转换为 UTF-8。使用text.translate(dict.fromkeys(ord(c) for c in string.punctuation))。
而\p{P}不包括<为>；这些不属于标点符号。他们是Math Symbol (Sm) chodepoints。
@MartijnPieters 感谢您的澄清！
@ivanab：string.punctuation 由 Unicode 中的 different standard 确定。两人不同意。

标签： python regex string python-2.7 unicode

【解决方案1】：

试试string模块

import string,re
text = u"<Üäik>"
out = re.sub('[%s]' % re.escape(string.punctuation), '', text)
print out
print type(out)

打印-

Üäik
<type 'unicode'>

【讨论】：

【解决方案2】：

< 和 > 被归类为 Math Symbols (Sm)，而不是标点符号 (P)。你可以匹配：

regex.sub('[\p{P}\p{Sm}]+', '', text)

unicode.translate() 方法也存在，它采用字典将整数（代码点）映射到其他整数代码点、unicode 字符或None； None 删除该代码点。将string.punctuation 映射到带有ord() 的代码点：

text.translate(dict.fromkeys(ord(c) for c in string.punctuation))

这只会删除有限数量的 ASCII 标点字符。

演示：

>>> import regex
>>> text = u"<Üäik>"
>>> print regex.sub('[\p{P}\p{Sm}]+', '', text)
Üäik
>>> import string
>>> print text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
Üäik

如果 string.punctuation 还不够，那么您可以通过从 0 迭代到 sys.maxunicode 为所有 P 和 Sm 代码点生成完整的 str.translate() 映射，然后针对 unicodedata.category() 测试这些值：

>>> import sys, unicodedata
>>> toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))
>>> print text.translate(toremove)
Üäik

（对于 Python 3，将 unicode 替换为 str，将 print ... 替换为 print(...))。

【讨论】：

运行regex.sub('[\p{P}\p{Ms}]+', '', text)时出现错误_regex_core.error: unknown property at position 12
@SIslam：我的错，我把类的缩写弄错了。我已经在我的回答中更正了它。

【解决方案3】：

\p{P} 匹配标点符号。

那些标点符号是

! ' # S % & ' ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~

< 和 > 不是标点符号。所以它们不会被删除。

试试这个

re.sub('[\p{L}<>]+',"",text)

【讨论】：