如何为 CoreNLP 提供一些预先标记的命名实体？答案

【问题标题】：How to feed CoreNLP some pre-labeled Named Entities?如何为 CoreNLP 提供一些预先标记的命名实体？
【发布时间】：2019-01-30 15:08:06
【问题描述】：

我想使用 Standford CoreNLP 提取 Coreferences 并开始处理预标记文本的依赖关系。我最终希望在相关的命名实体之间建立图形节点和边。我在 python 中工作，但使用 nltk 的 java 函数直接调用“edu.stanford.nlp.pipeline.StanfordCoreNLP”jar（这就是 nltk 在幕后所做的）。

我的预标记文本是这种格式：

PRE-LABELED:  During his youth, [PERSON: Alexander III of Macedon] was tutored by [PERSON: Aristotle] until age 16.  Following the conquest of [LOCATION: Anatolia], [PERSON: Alexander] broke the power of [LOCATION: Persia] in a series of decisive battles, most notably the battles of [LOCATION: Issus] and [LOCATION: Gaugamela].  He subsequently overthrew [PERSON: Persian King Darius III] and conquered the [ORGANIZATION: Achaemenid Empire] in its entirety.

我尝试做的是自己标记我的句子，构建一个 IOB 格式的元组列表：[ ("During","O"), ("his","O"), ("youth"," O"), ("Alexander","B-PERSON"), ("III","I-PERSON"), ...]

但是，我不知道如何告诉 CoreNLP 将此元组列表作为起点，构建最初未标记的其他命名实体，并在这些新的、更高质量的标记化句子上找到共同引用。我显然尝试简单地去除我的标签，并让 CoreNLP 自己做这件事，但 CoreNLP 在查找命名实体方面不如人工标记的预标记文本那么好。

我需要如下输出。我知道以这种方式使用 Dependencies 来获取 Edge 会很困难，但我需要看看我能走多远。

DESIRED OUTPUT:
[Person 1]:
Name: Alexander III of Macedon
Mentions:
* "Alexander III of Macedon"; Sent1 [4,5,6,7] # List of tokens
* "Alexander"; Sent2 [6]
* "He"; Sent3 [1]
Edges:
* "Person 2"; "tutored by"; "Aristotle"

[Person 2]:
Name: Aristotle
[....]

我怎样才能为 CoreNLP 提供一些预先识别的命名实体，并且仍然获得有关其他命名实体、共指和基本依赖项的帮助？

附：请注意，这不是 NLTK Named Entity Recognition with Custom Data 的副本。我不是想用我预先标记的 NER 训练一个新的分类器，我只是想在运行共同引用（包括提及）和对给定句子的依赖时将 CoreNLP 添加到我自己的分类器中。

【问题讨论】：

标签： python nltk stanford-nlp named-entity-recognition

【解决方案1】：

答案是用Additional TokensRegexNER Rules 制作一个规则文件。

我使用正则表达式将带标签的名称分组。从这里我构建了一个规则临时文件，我用-ner.additional.regexner.mapping mytemprulesfile 将它传递给corenlp jar。

Alexander III of Macedon    PERSON      PERSON,LOCATION,ORGANIZATION,MISC
Aristotle                   PERSON      PERSON,LOCATION,ORGANIZATION,MISC
Anatolia                    LOCATION    PERSON,LOCATION,ORGANIZATION,MISC
Alexander                   PERSON      PERSON,LOCATION,ORGANIZATION,MISC
Persia                      LOCATION    PERSON,LOCATION,ORGANIZATION,MISC
Issus                       LOCATION    PERSON,LOCATION,ORGANIZATION,MISC
Gaugamela                   LOCATION    PERSON,LOCATION,ORGANIZATION,MISC
Persian King Darius III     PERSON      PERSON,LOCATION,ORGANIZATION,MISC
Achaemenid Empire           ORGANIZATION    PERSON,LOCATION,ORGANIZATION,MISC

为了便于阅读，我已对齐此列表，但这些是制表符分隔的值。

一个有趣的发现是，一些多词预先标记的实体保持最初标记的多词，而在没有规则文件的情况下运行 corenlp 有时会将这些标记拆分为单独的实体。

我曾想专门识别命名实体标记，认为它会使共同引用更容易，但我想现在就可以了。无论如何，在一个文档中实体名称相同但不相关的频率是多少？

示例 （执行大约需要 70 秒）

import os, re, tempfile, json, nltk, pprint
from subprocess import PIPE
from nltk.internals import (
    find_jar_iter,
    config_java,
    java,
    _java_options,
    find_jars_within_path,
)

def ExtractLabeledEntitiesByRegex( text, regex ):
    rgx = re.compile(regex)
    nelist = []
    for mobj in rgx.finditer( text ):
        ne = mobj.group('ner')
        try:
            tag = mobj.group('tag')
        except IndexError:
            tag = 'PERSON'
        mstr = text[mobj.start():mobj.end()]
        nelist.append( (ne,tag,mstr) )
    cleantext = rgx.sub("\g<ner>", text)
    return (nelist, cleantext)

def GenerateTokensNERRules( nelist ):
    rules = ""
    for ne in nelist:
        rules += ne[0]+'\t'+ne[1]+'\tPERSON,LOCATION,ORGANIZATION,MISC\n'
    return rules

def GetEntities( origtext ):
    nelist, cleantext = ExtractLabeledEntitiesByRegex( origtext, '(\[(?P<tag>[a-zA-Z]+)\:\s*)(?P<ner>(\s*\w)+)(\s*\])' )

    origfile = tempfile.NamedTemporaryFile(mode='r+b', delete=False)
    origfile.write( cleantext.encode('utf-8') )
    origfile.flush()
    origfile.seek(0)
    nerrulefile = tempfile.NamedTemporaryFile(mode='r+b', delete=False)
    nerrulefile.write( GenerateTokensNERRules(nelist).encode('utf-8') )
    nerrulefile.flush()
    nerrulefile.seek(0)

    java_options='-mx4g'
    config_java(options=java_options, verbose=True)
    stanford_jar = '../stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2.jar'
    stanford_dir = os.path.split(stanford_jar)[0]
    _classpath = tuple(find_jars_within_path(stanford_dir))

    cmd = ['edu.stanford.nlp.pipeline.StanfordCoreNLP',
        '-annotators','tokenize,ssplit,pos,lemma,ner,parse,coref,coref.mention,depparse,natlog,openie,relation',
        '-ner.combinationMode','HIGH_RECALL',
        '-ner.additional.regexner.mapping',nerrulefile.name,
        '-coref.algorithm','neural',
        '-outputFormat','json',
        '-file',origfile.name
        ]

    # java( cmd, classpath=_classpath, stdout=PIPE, stderr=PIPE )
    stdout, stderr = java( cmd, classpath=_classpath, stdout=PIPE, stderr=PIPE )    # Couldn't get working- stdin=textfile
    PrintJavaOutput( stdout, stderr )

    origfilenametuple = os.path.split(origfile.name)
    jsonfilename = origfilenametuple[len(origfilenametuple)-1] + '.json'

    os.unlink( origfile.name )
    os.unlink( nerrulefile.name )
    origfile.close()
    nerrulefile.close()

    with open( jsonfilename ) as jsonfile:
        jsondata = json.load(jsonfile)

    currentid = 0
    entities = []
    for sent in jsondata['sentences']:
        for thisentity in sent['entitymentions']:
            tag = thisentity['ner']
            if tag == 'PERSON' or tag == 'LOCATION' or tag == 'ORGANIZATION':
                entity = {
                    'id':currentid,
                    'label':thisentity['text'],
                    'tag':tag
                }
                entities.append( entity )
                currentid += 1

    return entities

#### RUN ####
corpustext = "During his youth, [PERSON:Alexander III of Macedon] was tutored by [PERSON: Aristotle] until age 16.  Following the conquest of [LOCATION: Anatolia], [PERSON: Alexander] broke the power of [LOCATION: Persia] in a series of decisive battles, most notably the battles of [LOCATION: Issus] and [LOCATION: Gaugamela].  He subsequently overthrew [PERSON: Persian King Darius III] and conquered the [ORGANIZATION: Achaemenid Empire] in its entirety."

entities = GetEntities( corpustext )
for thisent in entities:
    pprint.pprint( thisent )

输出

{'id': 0, 'label': 'Alexander III of Macedon', 'tag': 'PERSON'}
{'id': 1, 'label': 'Aristotle', 'tag': 'PERSON'}
{'id': 2, 'label': 'his', 'tag': 'PERSON'}
{'id': 3, 'label': 'Anatolia', 'tag': 'LOCATION'}
{'id': 4, 'label': 'Alexander', 'tag': 'PERSON'}
{'id': 5, 'label': 'Persia', 'tag': 'LOCATION'}
{'id': 6, 'label': 'Issus', 'tag': 'LOCATION'}
{'id': 7, 'label': 'Gaugamela', 'tag': 'LOCATION'}
{'id': 8, 'label': 'Persian King Darius III', 'tag': 'PERSON'}
{'id': 9, 'label': 'Achaemenid Empire', 'tag': 'ORGANIZATION'}
{'id': 10, 'label': 'He', 'tag': 'PERSON'}

【讨论】：