【问题标题】:Split text with a space that is preceded with a non-letter char使用前面带有非字母字符的空格分割文本
【发布时间】:2020-04-02 11:30:25
【问题描述】:

由于我在互联网上找不到任何解决方案,我只想在这里问我的问题。

我想在每个标点符号处拆分给定的文本。因此,不仅在每个句子之后,而且在例如逗号之后。到目前为止,我遇到了自然语言工具包 (tltk) 和正则表达式,但我没有成功。

这个效果很好,但并没有完全满足我的期望:

sample_text = """With this example I wanna make the point clear... I hope you get it! There are many coding
languages out there, but which is the best? I would say there's no best. Change my mind - if you can!"""

split_text = nltk.tokenize.sent_tokenize(sample_text)
print(split_text)

#Output: ['With this example I wanna make the point clear...', 'I hope you get it!', 'There are many coding languages out there, but which is the best?', "I would say there's no best.", 'Change my mind - if you can!']

这已经很好了,但我最好希望收到一个输出,它甚至可以将文本拆分为逗号或连字符。所以输出看起来像这样:

[
 'With this example I wanna make the point clear...',
 'I hope you get it!',
 'There are many coding languages out there,',
 'but which is the best?',
 "I would say there's no best.",
 'Change my mind -',
 'if you can!'
]

使用正则表达式可能会更好,不是吗?但不知何故,我没有得到它的工作。 在此先感谢,感谢任何帮助!

【问题讨论】:

  • 试试re.findall(r"\w['\w\s]*[^'\w\s]*", sample_text)。我认为您需要从标点符号中排除'_。请参阅Python demo

标签: python regex split


【解决方案1】:

正则表达式效果很好,尝试在 .split() 中使用此表达式

[!"\#$%&'()*+,\-.\/:;<=>?@\[\\\]^_‘{|}~]

【讨论】:

    【解决方案2】:

    您可以在前面没有字母的空格上拆分字符串:

    split_text = re.split('(?<=[^a-z]) ', sample_text, 0, re.I)
    print(split_text)
    

    输出:

    [
     'With this example I wanna make the point clear...',
     'I hope you get it!',
     'There are many coding languages out there,',
     'but which is the best?',
     "I would say there's no best.",
     'Change my mind -',
     'if you can!'
    ]
    

    【讨论】:

    • 效果很好,感谢您的帮助
    • @Keanu 不用担心 - 我很高兴能帮上忙。
    猜你喜欢
    • 1970-01-01
    • 2020-06-30
    • 2021-05-15
    • 2016-05-15
    • 1970-01-01
    • 1970-01-01
    • 2012-02-14
    • 2019-10-03
    • 2012-09-22
    相关资源
    最近更新 更多