如何在 NER 模型上设置空白标记器？答案

【问题标题】：How to set whitespace tokenizer on NER Model?如何在 NER 模型上设置空白标记器？
【发布时间】：2016-12-20 00:50:05
【问题描述】：

我正在使用 CoreNLP 3.6.0 创建自定义 NER 模型

我的道具是：

# location of the training file 
trainFile = /home/damiano/stanford-ner.tsv 
# location where you would like to save (serialize) your 
# classifier; adding .gz at the end automatically gzips the file, 
# making it smaller, and faster to load 
serializeTo = ner-model.ser.gz

# structure of your training file; this tells the classifier that 
# the word is in column 0 and the correct answer is in column 1 
map = word=0,answer=1

# This specifies the order of the CRF: order 1 means that features 
# apply at most to a class pair of previous class and current class 
# or current class and next class. 
maxLeft=1

# these are the features we'd like to train with 
# some are discussed below, the rest can be 
# understood by looking at NERFeatureFactory 
useClassFeature=true 
useWord=true 
# word character ngrams will be included up to length 6 as prefixes 
# and suffixes only  
useNGrams=true 
noMidNGrams=true 
maxNGramLeng=6 
usePrev=true 
useNext=true 
useDisjunctive=true 
useSequences=true 
usePrevSequences=true 
# the last 4 properties deal with word shape features 
useTypeSeqs=true 
useTypeSeqs2=true 
useTypeySequences=true 
wordShape=chris2useLC

我用这个命令构建：

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier  -prop /home/damiano/stanford-ner.prop

问题是当我使用这个模型来检索文本文件中的实体时。命令是：

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile file.txt

file.txt 是：

Hello!
my
name
is
John.

输出是：

你好/O !/O 我的/O 名字/O 是/O John/PERSON ./O

如您所见，它拆分了“Hello！”分成两个令牌。 “约翰”也是如此。

我必须使用空格标记器。

如何设置？

为什么 CoreNlp 将这些词分成两个标记？

【问题讨论】：

标签： machine-learning nlp stanford-nlp

【解决方案1】：

更新。如果您想在此处使用空格标记器，~~只需将tokenize.whitespace=true 添加到您的属性文件。~~查看Christopher Manning's answer。

但是，在回答您的第二个问题“为什么 CoreNlp 将这些词分成两个标记？”时，我建议保留默认标记器 (which is PTBTokenizer)，因为它只会让获得更好的结果。通常切换到空白标记化的原因是对处理速度的高要求或（通常 - 和）对标记化质量的低要求。由于您打算将其用于进一步的 NER，因此我怀疑这是您的情况。

即使在您的示例中，如果您在标记化后有标记John.，它也无法被公报或训练示例捕获。可以在here 找到更多细节和为什么标记化不是那么简单的原因。

【讨论】：

【解决方案2】：

您可以通过将类名指定为 tokenizerFactory 标志/属性来设置自己的标记器：

tokenizerFactory = edu.stanford.nlp.process.WhitespaceTokenizer$WhitespaceTokenizerFactory

您可以指定任何实现Tokenizer<T> 接口的类，但包含的WhitespaceTokenizer 听起来像您想要的。如果标记器有选项，您可以使用tokenizerOptions 指定它们，例如，在这里，如果您还指定：

tokenizerOptions = tokenizeNLs=true

然后输入中的换行符将保留在输入中（对于不总是将内容转换为每行一个令牌格式的输出选项）。

注意：tokenize.whitespace=true 等选项适用于 CoreNLP 级别。如果将它们提供给诸如 CRFClassifier 之类的单个组件，则它们不会被解释（您会收到一条警告说该选项被忽略）。

正如 Nikita Astrakhantsev 所说，这不一定是一件好事。只有在您的训练数据也是空格分隔的情况下，在测试时这样做才是正确的，否则会对性能产生不利影响。并且拥有像从空格分离中获得的标记那样的标记对于进行后续 NLP 处理（例如解析）是不利的。

【讨论】：

我在这个功能上停留了 1 个小时。谢谢克里斯。