使用正则表达式进行句子分割答案

【问题标题】：Sentence segmentation using Regex使用正则表达式进行句子分割
【发布时间】：2014-09-08 11:35:43
【问题描述】：

我的文本（SMS）消息很少，我想使用句点（'.'）作为分隔符对它们进行分段。我无法处理以下类型的消息。如何在 Python 中使用 Regex 对这些消息进行分段。

分割前：

'超计数 16.8mmol/l.plz 审查 b4 下午 5 点。只是为了通知你。谢谢你' '床位数 8.请告知负责人.tq'

分割后：

'超计数 16.8mmol/l' '请评论 b4 下午 5 点' '只是为了通知你' '谢谢你' '床位数 8' '请告知负责人' 'tq'

每一行都是一个单独的消息

更新：

我正在做自然语言处理，我觉得可以将'16.8mmmol/l' 和'no of beds 8.2 cups of tea.' 视为相同。 80% 的准确率对我来说已经足够了，但我想尽可能减少 False Positive。

【问题讨论】：

我认为您的句子不规则，因此正则表达式不是合适的解决方案，除非您提供所有拆分规则。
您如何区分单位 (16.8) 和恰好以数字结尾和开头的句子（床数 8.2 杯茶）？
我正在做自然语言处理，我觉得可以将16.8mmmol/l 和no of beds 8.2 cups of tea. 视为相同。 80% 的准确率对我来说已经足够了，但我想尽可能减少误报。
我想不可能告诉你的测试对象如何正确写作？只需几个空格就可以了很长的路...
@polishchuk 但是数字是有规律的，可以使用正则表达式来避免由于数字中的点而发生分裂，请参阅我的回答

标签： python regex text-segmentation

【解决方案1】：

几周前，我搜索了一个正则表达式，它可以捕获表示字符串中数字的每个字符串，无论数字的书写形式是什么，甚至是科学记数法形式，甚至是带有逗号的印度数字：见@ 987654321@

我在以下代码中使用此正则表达式来解决您的问题。

与其他答案相反，在我的解决方案中，'8.' 中的一个点不被视为必须进行拆分的点，因为它可以读取为具有点后没有数字。

import re

regx = re.compile('(?<![\d.])(?!\.\.)'
                  '(?<![\d.][eE][+-])(?<![\d.][eE])(?<!\d[.,])'
                  '' #---------------------------------
                  '([+-]?)'
                  '(?![\d,]*?\.[\d,]*?\.[\d,]*?)'
                  '(?:0|,(?=0)|(?<!\d),)*'
                  '(?:'
                  '((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
                  '|\.(0)'
                  '|((?<!\.)\.\d+?)'
                  '|([\d,]+\.\d+?))'
                  '0*'
                  '' #---------------------------------
                  '(?:'
                  '([eE][+-]?)(?:0|,(?=0))*'
                  '(?:'
                  '(?!0+(?=\D|\Z))((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
                  '|((?<!\.)\.(?!0+(?=\D|\Z))\d+?)'
                  '|([\d,]+\.(?!0+(?=\D|\Z))\d+?))'
                  '0*'
                  ')?'
                  '' #---------------------------------
                  '(?![.,]?\d)')



simpler_regex = re.compile('(?<![\d.])0*(?:'
                           '(\d+)\.?|\.(0)'
                           '|(\.\d+?)|(\d+\.\d+?)'
                           ')0*(?![\d.])')


def split_outnumb(string, regx=regx, a=0):
    excluded_pos = [x for mat in regx.finditer(string) for x in range(*mat.span()) if string[x]=='.']
    li = []
    for xdot in (x for x,c in enumerate(string) if c=='.' and x not in excluded_pos):
        li.append(string[a:xdot])
        a = xdot + 1
    li.append(string[a:])
    return li





for sentence in ('hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u',
                 'no of beds 8.please inform person in-charge.tq',
                 'no of beds 8.2 cups of tea.tarabada',
                 'this number .977 is a float',
                 'numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation',
                 'an indian number 12,45,782.258 in this.sentence and 45,78,325. is another',
                 'no dot in this sentence',
                 ''):
    print 'sentence         =',sentence
    print 'splitted eyquem  =',split_outnumb(sentence)
    print 'splitted eyqu 2  =',split_outnumb(sentence,regx=simpler_regex)
    print 'splitted gurney  =',re.split(r"\.(?!\d)", sentence)
    print 'splitted stema   =',re.split('(?<!\d)\.|\.(?!\d)',sentence)
    print

结果

sentence         = hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u
splitted eyquem  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted eyqu 2  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted gurney  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted stema   = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']

sentence         = no of beds 8.please inform person in-charge.tq
splitted eyquem  = ['no of beds 8.please inform person in-charge', 'tq']
splitted eyqu 2  = ['no of beds 8.please inform person in-charge', 'tq']
splitted gurney  = ['no of beds 8', 'please inform person in-charge', 'tq']
splitted stema   = ['no of beds 8', 'please inform person in-charge', 'tq']

sentence         = no of beds 8.2 cups of tea.tarabada
splitted eyquem  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted eyqu 2  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted gurney  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted stema   = ['no of beds 8.2 cups of tea', 'tarabada']

sentence         = this number .977 is a float
splitted eyquem  = ['this number .977 is a float']
splitted eyqu 2  = ['this number .977 is a float']
splitted gurney  = ['this number .977 is a float']
splitted stema   = ['this number ', '977 is a float']

sentence         = numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation
splitted eyquem  = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted eyqu 2  = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted gurney  = ['numbers 214.21E+45 , 478945', 'E-201 and .12478E+02 are in scientific', 'notation']
splitted stema   = ['numbers 214.21E+45 , 478945', 'E-201 and ', '12478E+02 are in scientific', 'notation']

sentence         = an indian number 12,45,782.258 in this.sentence and 45,78,325. is another
splitted eyquem  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted eyqu 2  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted gurney  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']
splitted stema   = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']

sentence         = no dot in this sentence
splitted eyquem  = ['no dot in this sentence']
splitted eyqu 2  = ['no dot in this sentence']
splitted gurney  = ['no dot in this sentence']
splitted stema   = ['no dot in this sentence']

sentence         = 
splitted eyquem  = ['']
splitted eyqu 2  = ['']
splitted gurney  = ['']
splitted stema   = ['']

编辑 1

我添加了一个simpler_regex检测数字，来自我在this thread的帖子

我没有检测到印度数字和科学记数法中的数字，但它实际上给出了相同的结果

【讨论】：

+1 用于比较不同的解决方案。非常好。

【解决方案2】：

您可以使用否定的前瞻断言来匹配“.”后面没有数字，并在此使用re.split：

>>> import re
>>> splitter = r"\.(?!\d)"
>>> s = 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u'
>>> re.split(splitter, s)
['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
>>> s = 'no of beds 8.please inform person in-charge.tq'
>>> re.split(splitter, s)
['no of beds 8', 'please inform person in-charge', 'tq']

【讨论】：

【解决方案3】：

怎么样

re.split('(?<!\d)\.|\.(?!\d)', 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u')

环视确保一侧或另一侧不是数字。所以这也涵盖了16.8 的情况。如果两边都有数字，这个表达式不会分裂。

【讨论】：

'(?<!\d)\.|\.(?!\d)' 根据beds 8.please 和number .977 is 中的点进行拆分。这是你真正想要的吗？如果不是，则正则表达式模式必须是 '(?<!\d)\.(?!\d)'
@eyquem 是的，这就是我想要的，也是我在解释中写的。问题是 OP 想要什么。
either on one or the other side is not a digit这句话不是意思吗：点的左边不能有数字，点的右边不能有数字 ?我不会说英语，有时我不能正确理解英语。
@eyquem 不，这意味着 OR。我接受点的一侧的数字，但不接受两侧的数字。
你是对的。我不知道为什么我明白你在哪里写的或者......！我应该这样写：点的左右两边不能有数字，意思是either on one or the other side is not a digit。但我认为如果你写的话会更清楚和更准确：在一侧或另一侧是非数字

【解决方案4】：

这取决于你的确切句子，但你可以尝试：

.*?[a-zA-Z0-9]\.(?!\d)

看看这是否有效。这将保留在引号中，但您可以根据需要将其删除。

【讨论】：

【解决方案5】：

"...".split(".")

split 是一个 Python 内置函数，可以在特定字符处分隔字符串。

【讨论】：

这将拆分 .也是 16.8 毫摩尔/升。