【发布时间】:2016-12-20 00:50:05
【问题描述】:
我正在使用 CoreNLP 3.6.0 创建自定义 NER 模型
我的道具是:
# location of the training file
trainFile = /home/damiano/stanford-ner.tsv
# location where you would like to save (serialize) your
# classifier; adding .gz at the end automatically gzips the file,
# making it smaller, and faster to load
serializeTo = ner-model.ser.gz
# structure of your training file; this tells the classifier that
# the word is in column 0 and the correct answer is in column 1
map = word=0,answer=1
# This specifies the order of the CRF: order 1 means that features
# apply at most to a class pair of previous class and current class
# or current class and next class.
maxLeft=1
# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
我用这个命令构建:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop /home/damiano/stanford-ner.prop
问题是当我使用这个模型来检索文本文件中的实体时。命令是:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile file.txt
file.txt 是:
Hello!
my
name
is
John.
输出是:
你好/O !/O 我的/O 名字/O 是/O John/PERSON ./O
如您所见,它拆分了“Hello!”分成两个令牌。 “约翰”也是如此。
我必须使用空格标记器。
如何设置?
为什么 CoreNlp 将这些词分成两个标记?
【问题讨论】:
标签: machine-learning nlp stanford-nlp