【发布时间】:2016-11-05 21:24:39
【问题描述】:
我正在尝试根据 python 中的正常语法规则正确拆分句子。
我要拆分的句子是
s = """Mr. Smith bought cheapsite.com for 1.5 million dollars,
i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a
probability of .9 it isn't."""
预期的输出是
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
为了实现这一点,我正在使用常规,经过大量搜索后,我发现了以下正则表达式,它可以解决问题。new_str 只是为了从 's' 中删除一些 \n
m = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s',new_str)
for i in m:
print (i)
Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.
所以我理解 reg ex 的方式是我们首先选择
1) 像 i.e 这样的所有字符
2) 从第一次选择的过滤空格中,我们选择那些字符 没有像 Mr. Mrs. 这样的词
3) 从过滤的第二步中,我们只选择那些我们有点或问题并且前面有空格的主题。
所以我尝试如下更改顺序
1) 先过滤掉所有标题。
2) 从过滤后的步骤中选择前面有空格的步骤
3) 删除所有短语,例如 i.e
但是当我这样做时,后面的空白也会被分割
m = re.split(r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)',new_str)
for i in m:
print (i)
Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e.
he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.
修改后的过程中的最后一步是否应该能够识别短语,例如,为什么它无法检测到它?
【问题讨论】:
-
您将使用 nltk 将文本拆分为句子,在 Python 中编写精确的拆分正则表达式是不可能的(您可以尝试匹配一个,但这将是一个挑战)。
-
@WiktorStribiżew 我同意,但在这种情况下,我想了解正则表达式的细微差别以及为什么更改顺序会产生不正确的结果
-
您想说
new_str中的输入是否已将换行符替换为here 等常规空格?
标签: python regex nlp tokenize negative-lookbehind