如何从文本中提取命名实体+动词答案

【问题标题】：How to extract Named Entity + Verb from text如何从文本中提取命名实体+动词
【发布时间】：2016-11-08 05:03:50
【问题描述】：

嗯，我的目标是从文本中提取 NE（人称）和与之相关的动词。例如，我有这样的文字：

邓布利多转身走回街上。哈利波特在毯子里翻了个身，没有醒来。

作为一个理想的结果我应该得到

邓布利多转身走了；哈利波特翻滚

我使用斯坦福 NER 来查找和标记人员，然后删除所有不包含 NE 的句子。所以，最后我有一个“纯”文本，它只包含带有字符名称的句子。之后我使用斯坦福依赖项。结果我得到了这样的东西（CONLLU输出格式）：

1   Dumbledore  _   _   NN  _   2   nsubj   _   _
2   turned  _   _   VBD _   0   root    _   _
3   and _   _   CC  _   2   cc  _   _
4   walked  _   _   VBD _   2   conj    _   _
5   back    _   _   RB  _   4   advmod  _   _
6   down    _   _   IN  _   8   case    _   _
7   the _   _   DT  _   8   det _   _
8   street  _   _   NN  _   4   nmod    _   _
9   .   _   _   .   _   2   punct   _   _

1   Harry   _   _   NNP _   2   compound    _   _
2   Potter  _   _   NNP _   3   nsubj   _   _
3   rolled  _   _   VBD _   0   root    _   _
4   over    _   _   IN  _   3   compound:prt    _   _
5   inside  _   _   IN  _   7   case    _   _
6   his _   _   PRP$    _   7   nmod:poss   _   _
7   blankets    _   _   NNS _   3   nmod    _   _
8   without _   _   IN  _   9   mark    _   _
9   waking  _   _   VBG _   3   advcl   _   _
10  up  _   _   RP  _   9   compound:prt    _   _
11  .   _   _   .   _   3   punct   _   _

这就是我所有问题的开始。我知道人和动词，但是我不知道如何从这种格式中提取它。我想，我可以这样做：在表中找到 NN/NNP，找到它的“父”，然后提取它的所有“子”字。理论上它应该工作。理论上。

问题是，是否有人能想出任何其他想法，如何从文本中获取一个人及其行为？或者有没有更合理的方法？

如果有任何帮助，我将不胜感激！

【问题讨论】：

标签： java nlp stanford-nlp

【解决方案1】：

这里有一些示例代码可以帮助您解决问题：

import java.io.*;
import java.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.util.*;



public class NERAndVerbExample {

  public static void main(String[] args) throws IOException {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,depparse,entitymentions");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    String text = "John Smith went to the store.";
    Annotation annotation = new Annotation(text);
    pipeline.annotate(annotation);
    System.out.println("---");
    System.out.println("text: " + text);
    System.out.println("");
    System.out.println("dependency edges:");
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
      SemanticGraph sg = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
      for (SemanticGraphEdge sge : sg.edgeListSorted()) {
        System.out.println(
                sge.getGovernor().word() + "," + sge.getGovernor().index() + "," + sge.getGovernor().tag() + "," +
                        sge.getGovernor().ner()
                        + " - " + sge.getRelation().getLongName()
                        + " -> "
                        + sge.getDependent().word() + "," +
                        +sge.getDependent().index() + "," + sge.getDependent().tag() + "," + sge.getDependent().ner());
      }
      System.out.println();
      System.out.println("entity mentions:");
      for (CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {
        int lastTokenIndex = entityMention.get(CoreAnnotations.TokensAnnotation.class).size()-1;
        System.out.println(entityMention.get(CoreAnnotations.TextAnnotation.class) +
                "\t" +
                entityMention.get(CoreAnnotations.TokensAnnotation.class)
                        .get(lastTokenIndex).get(CoreAnnotations.IndexAnnotation.class) + "\t" +
                entityMention.get(CoreAnnotations.NamedEntityTagAnnotation.class));
      }
    }
  }
}

我希望在 Stanford CoreNLP 3.8.0 中添加一些语法糖，以帮助处理实体提及。

为了稍微解释一下这段代码，entitymentions 注释器基本上会遍历并将具有相同 NER 标记的标记组合在一起。因此，“John Smith”被标记为实体提及。

如果通过依赖图可以得到每个单词的索引。

同样，如果您访问实体提及的标记列表，您还可以找到实体提及的每个单词的索引。

使用更多代码，您可以将它们链接在一起，并按照您的要求形成实体提及动词对。

正如您在当前代码中看到的，访问实体提及的信息非常麻烦，因此我将尝试在 3.8.0 中改进它。

【讨论】：

哦，非常感谢！只有一个问题——我什至不能编译你的代码来看看它是如何工作的。它提供了一个编译信息“解析时到达文件末尾”。也许只是我做错了什么？是否有资源可以让我阅读有关实体提及和索引的信息？顺便说一句，我读过 SemRegex。在我看来，这个工具还可以帮助找到 NE+Verb 对。真的是这样吗？无论如何，感谢您的帮助！