使用正则表达式提取短语结构树中的叶节点答案

【问题标题】：Using regular expression to extract leaf nodes in phrase structure trees使用正则表达式提取短语结构树中的叶节点
【发布时间】：2013-02-23 18:10:46
【问题描述】：

我想在 Java 中使用正则表达式来提取句子或短语结构树中的叶节点。例如，给一个句子“这是一个简单的句子。”，

我有语法信息

输入： (ROOT (S (NP (DT This)) (VP (VBZ is) (NP (DT an) (JJ easy) (NN sentence))) (. .)))

我想用正则表达式提取叶节点

输出：

DT This
VBZ is
DT an
JJ easy
NN sentence
.  .

【问题讨论】：

kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

标签： regex nlp stanford-nlp

【解决方案1】：

如果你没有嵌套括号，那么你可以使用这个：

(?<=\()[^()]+(?=\))

看here on Regexr。

(?<=\() 是一个lookbehind assertion，确保匹配前有一个“(”

(?=\)) 是一个lookahead assertion，确保匹配后有一个“)”

[^()]+ 是一个negated character class，匹配（一个或多个）任何字符，但圆括号除外。

【讨论】：

非常好用！谢谢你！

【解决方案2】：

假设您使用的是基于与此问题相关的标签的斯坦福 NLP：

更简单的方法是使用 Tree 类中的内置方法 getLeaves()。

【讨论】：

【解决方案3】：

你需要的正则表达式是\(([^ ]+) +([^()]+)\)

它将：
\( 匹配一个左括号，
([^ ]+) 然后是一个或多个除空格之外的字符（并将其称为组 #1），
+ 然后是一个或更多空格，
([^()]+)，然后是一个或多个除括号外的字符（并将其称为第 2 组），
\)，最后是右括号。

要在 Java 中使用它，请在您的类中预编译该模式：

static final Pattern leaf = Pattern.compile("\\(([^ ]+) +([^()]+)\\)");

然后在每个输入字符串上创建一个匹配器并循环其 find 方法：

Matcher m = leaf.matcher(input);
while (m.find()) {
    // here do something with each leaf,
    // where m.group(1) is the node type (DT, VBZ...)
    // and m.group(2) is the word
}

【讨论】：

感谢托比亚的热情。很有帮助！