我想我设法想出了一个解决方案,尽管这是经过大量代码检查后的猜测。我创建了自己的 Ngram 标记器作为 NLTK NgramTagger 类的子类,如下所示:
class myNgramTagger(nltk.NgramTagger):
"""
My override of the NLTK NgramTagger class that considers previous
tokens rather than previous tags for context.
"""
def __init__(self, n, train=None, model=None,
backoff=None, cutoff=0, verbose=False):
nltk.NgramTagger.__init__(self, n, train, model, backoff, cutoff, verbose)
def context(self, tokens, index, history):
#tag_context = tuple(history[max(0,index-self._n+1):index])
tag_context = tuple(tokens[max(0,index-self._n+1):index])
return tag_context, tokens[index]
我唯一更改的行是上下文方法中的注释行,我将历史列表更改为标记列表。我几乎只是猜测这可能会达到我想要的效果,但它似乎适用于模型和训练数据。
test_sent = ["When","a","small","plane","crashed","into","the","river","a","general","alert","was","a","given"]
tm2 = {
(('When',), 'a') : "XX",
(('into',), 'the') : "YY",
}
tm3 = {
(('a','general'), 'alert') : "ZZ",
}
taggerd = nltk.DefaultTagger('NA')
tagger2w = myNgramTagger(2,model=tm2,backoff=taggerd)
tagger3w = myNgramTagger(3,model=tm3,backoff=tagger2w)
print tagger3w.tag(test_sent)
[('When', 'NA'), ('a', 'XX'), ('small', 'NA'), ('plane', 'NA'), ('crashed', 'NA'), ('into', 'NA'), ('the', 'YY'), ('river', 'NA'), ('a', 'NA'), ('general', 'NA'), ('alert', 'ZZ'), ('was', 'NA'), ('a', 'NA'), ('given', 'NA')]
因此,仅通过在一种方法中更改一个单词,我似乎已经设法得到我想要的,使用标记作为上下文而不是标签的 Ngram 标记。
我尝试使用带有新闻类别的棕色语料库进行类似的训练(因此我选择了测试句),它似乎工作得很好,实际上比使用标签更好,因为它设法标记句子中的所有内容识别,而不是在看到它不识别的东西时停下来:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_tagger_bigram = myNgramTagger(2,brown_tagged_sents)
brown_tagger_trigram = myNgramTagger(3,brown_tagged_sents,backoff=brown_tagger_bigram)
print brown_tagger_trigram.tag(test_sent)
[('When', u'WRB'), ('a', u'AT'), ('small', u'JJ'), ('plane', None), ('crashed', None), ('into', None), ('the', u'AT'), ('river', None), ('a', None), ('general', u'JJ'), ('alert', None), ('was', None), ('a', u'AT'), ('given', u'VBN')]
将此与普通的 NLTK Ngram 标记器进行比较实际上表明这是一个改进:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_tagger_bigram = nltk.NgramTagger(2,brown_tagged_sents)
brown_tagger_trigram = nltk.NgramTagger(3,brown_tagged_sents,backoff=brown_tagger_bigram)
print brown_tagger_trigram.tag(test_sent)
[('When', u'WRB'), ('a', u'AT'), ('small', u'JJ'), ('plane', None), ('crashed', None), ('into', None), ('the', None), ('river', None), ('a', None), ('general', None), ('alert', None), ('was', None), ('a', None), ('given', None)]
使用标记上下文进行标记可以一直到句子的结尾都给出不错的结果,而使用标记上下文进行标记只能到第三个单词为止。