【问题标题】:NLTK tokenize text with dialog into sentencesNLTK 将带有对话的文本标记为句子
【发布时间】:2018-03-11 23:38:36
【问题描述】:

我能够将非对话文本标记为句子,但是当我在句子中添加引号时,NLTK 标记器无法正确拆分它们。例如,这按预期工作:

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text1 = 'Is this one sentence? This is separate. This is a third he said.'
tokenizer.tokenize(text1)

这会产生一个包含三个不同句子的列表:

['Is this one sentence?', 'This is separate.', 'This is a third he said.']

但是,如果我把它变成对话,同样的过程就行不通了。

text2 = '“Is this one sentence?” “This is separate.” “This is a third” he said.'
tokenizer.tokenize(text2)

这会将它作为一个句子返回:

['“Is this one sentence?” “This is separate.” “This is a third” he said.']

在这种情况下如何使 NLTK 标记器工作?

【问题讨论】:

    标签: python nltk


    【解决方案1】:

    标记器似乎不知道如何处理定向引号。将它们替换为常规的 ASCII 双引号,示例运行良好。

    >>> text3 = re.sub('[“”]', '"', text2)
    >>> nltk.sent_tokenize(text3)
    ['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.']
    

    【讨论】:

      猜你喜欢
      • 2020-03-29
      • 2022-07-06
      • 2015-05-16
      • 1970-01-01
      • 2016-10-03
      • 2011-07-31
      • 1970-01-01
      • 2019-10-18
      • 2012-12-15
      相关资源
      最近更新 更多