按照惯例处理逗号和句号答案

【问题标题】：Dealing with comma and fullstops as per convention按照惯例处理逗号和句号
【发布时间】：2021-04-21 04:42:07
【问题描述】：

我有各种字符串实例，例如：

- hello world,i am 2000to -> hello world, i am 2000 to
- the state was 56,869,12th -> the state was 66,869, 12th
- covering.2% -> covering. 2%
- fiji,295,000 -> fiji, 295,000

为了处理第一种情况，我想出了两步正则表达式：

re.sub(r"(?<=[,])(?=[^\s])(?=[^0-9])", r" ", text) # hello world, i am 20,000to
re.sub(r"(?<=[0-9])(?=[.^[a-z])", r" ", text) # hello world, i am 20,000 to

但这会以一些不同的方式破坏文本，并且其他情况也没有涵盖。任何人都可以建议一个更通用的正则表达式来正确解决所有情况。我试过使用replace，但它做了一些意想不到的替换，这反过来又引发了一些其他问题。我不是正则表达式方面的专家，希望得到指点。

【问题讨论】：

示例covering.2% 对我来说似乎模棱两可，因为十进制值可能是.2%
每个十进制值都以某个数字开头，后跟.。所以 0.2 而不是 .2。

标签： python python-3.x regex

【解决方案1】：

这种方法通过将文本分解为标记来涵盖上述情况：

in_list = [
        'hello world,i am 2000to',
        'the state was 56,869,12th',
        'covering.2%',
        'fiji,295,000',
        'and another example with a decimal 12.3not4,5 is right out',
        'parrot,, is100.00% dead'
        'Holy grail runs for this portion of 100 minutes,!, 91%. Fascinating'
    ]
tokenizer = re.compile(r'[a-zA-Z]+[\.,]?|(?:\d{1,3}(?:,\d{3})+|\d+)(?:\.\d+)?(?:%|st|nd|rd|th)?[\.,]?')
for s in in_list:
    print(' '.join(re.findall(pattern=tokenizer, string=s)))

#    hello world, i am 2000 to
#    the state was 56,869, 12th
#    covering. 2%
#    fiji, 295,000
#    and another example with a decimal 12.3 not 4, 5 is right out
#    parrot, is 100.00% dead
#    Holy grail runs for this portion of 100 minutes, 91%. Fascinating

分解正则表达式，每个标记都是最长的可用子字符串：

仅包含或不包含句点或逗号的字母，[a-zA-Z]+[\.,]?
或|
一个数字表达式，可以是
- 1 到 3 位 \d{1,3} 后跟任意数量的逗号组 + 3 位 (?:,\d{3})+
- 或|任意数量的无逗号数字\d+
- 可选小数点后跟至少一位数字(?:\.\d+)，
- 可选后缀（百分比、'st'、'nd'、'rd'、'th'）(?:[\.,%]|st|nd|rd|th)?
- 可选句点或逗号[\.]?

注意(?:blah) 用于抑制re.findall 告诉您每个带括号的组如何单独匹配的自然愿望。在这种情况下，我们只希望它在字符串中向前走，?: 完成了这个。

【讨论】：

谢谢。但它会产生一个问题。它会将 1.9% 到 1. 9%（所有此类示例）等实例打破。
我扩展了正则表达式以包含这个（新）案例。顺便说一句，regex101.com 是测试这些东西的绝佳资源。
我发现了另一个由于正则表达式而受到影响的测试用例。如果十进制值位于字符串末尾，例如world 12.9%.，它会删除小数点，给出world 12.9%。你能提供一个一般的修复吗？谢谢。
当然 - 将句点和逗号 [/.,]? 移到末尾允许 %. 或 12th, 之类的情况保持不变。