如何将coreNLP生成的解析树转换为data.tree R包答案

【问题标题】：How to convert coreNLP generated parse tree into data.tree R package如何将coreNLP生成的解析树转换为data.tree R包
【发布时间】：2016-02-19 03:02:24
【问题描述】：

我想把R包coreNLP生成的解析树转换成data.tree R包格式。使用以下代码生成解析树：

 options( java.parameters = "-Xmx2g" ) 
library(NLP)
library(coreNLP)
#initCoreNLP() # change this if downloaded to non-standard location
initCoreNLP(annotators = "tokenize,ssplit,pos,lemma,parse")
## Some text.
s <- c("A rare black squirrel has become a regular visitor to a suburban garden.")
s <- as.String(s)


anno<-annotateString(s)
parse_tree <- getParse(anno)
parse_tree

The output parse tree is as follows:
> parse_tree
[1] "(ROOT\r\n  (S\r\n    (NP (DT A) (JJ rare) (JJ black) (NN squirrel))\r\n    (VP (VBZ has)\r\n      (VP (VBN become)\r\n        (NP (DT a) (JJ regular) (NN visitor))\r\n        (PP (TO to)\r\n          (NP (DT a) (JJ suburban) (NN garden)))))\r\n    (. .)))\r\n\r\n"

我发现以下帖子Visualize Parse Tree Structure .它将openNLP包生成的解析树转换为树格式。但是解析树与 coreNLP 生成的不同，解决方案也不会转换为我想要的 data.tree 格式。

编辑通过添加下面的2行我们可以使用帖子中提供的功能Visualize Parse Tree Structure

# this step modifies coreNLP parse tree to mimic openNLP parse tree
parse_tree <- gsub("[\r\n]", "", parse_tree)
parse_tree <- gsub("ROOT", "TOP", parse_tree)

library(igraph)
library(NLP)

parse2graph(parse_tree,  # plus optional graphing parameters
            title = sprintf("'%s'", x), margin=-0.05,
            vertex.color=NA, vertex.frame.color=NA,
            vertex.label.font=2, vertex.label.cex=1.5, asp=0.5,
            edge.width=1.5, edge.color='black', edge.arrow.size=0)

但我想要的是将解析树转换为data.tree包提供的data.tree格式

【问题讨论】：

我不熟悉 flex/bison。我进行了谷歌搜索，但没有找到 R 和 CoreNLP 与野牛的任何链接。您对 openNLP 上其他问题的回答非常有帮助。感谢您的帮助。
我是新手，所以我无法确定问题是否被投票关闭。如果有人有任何改进问题的建议将很乐意接受。我真的对答案很感兴趣，因为我花了无数时间来回答这个问题..

标签： r tree stanford-nlp

【解决方案1】：

一旦有了边列表，转换为 data.tree 就很简单了。仅替换 parse2graph 函数的最后一位，并将样式移出函数：

parse2tree <- function(ptext) {
  stopifnot(require(NLP) && require(igraph))

  ## Replace words with unique versions
  ms <- gregexpr("[^() ]+", ptext)                                      # just ignoring spaces and brackets?
  words <- regmatches(ptext, ms)[[1]]                                   # just words
  regmatches(ptext, ms) <- list(paste0(words, seq.int(length(words))))  # add id to words

  ## Going to construct an edgelist and pass that to igraph
  ## allocate here since we know the size (number of nodes - 1) and -1 more to exclude 'TOP'
  edgelist <- matrix('', nrow=length(words)-2, ncol=2)

  ## Function to fill in edgelist in place
  edgemaker <- (function() {
    i <- 0                                       # row counter
    g <- function(node) {                        # the recursive function
      if (inherits(node, "Tree")) {            # only recurse subtrees
        if ((val <- node$value) != 'TOP1') { # skip 'TOP' node (added '1' above)
          for (child in node$children) {
            childval <- if(inherits(child, "Tree")) child$value else child
            i <<- i+1
            edgelist[i,1:2] <<- c(val, childval)
          }
        }
        invisible(lapply(node$children, g))
      }
    }
  })()

  ## Create the edgelist from the parse tree
  edgemaker(Tree_parse(ptext))
  tree <- FromDataFrameNetwork(as.data.frame(edgelist))
  return (tree)
}


parse_tree <- "(ROOT\r\n  (S\r\n    (NP (DT A) (JJ rare) (JJ black) (NN squirrel))\r\n    (VP (VBZ has)\r\n      (VP (VBN become)\r\n        (NP (DT a) (JJ regular) (NN visitor))\r\n        (PP (TO to)\r\n          (NP (DT a) (JJ suburban) (NN garden)))))\r\n    (. .)))\r\n\r\n"
parse_tree <- gsub("[\r\n]", "", parse_tree)
parse_tree <- gsub("ROOT", "TOP", parse_tree)

library(data.tree)

tree <- parse2tree(parse_tree)
tree
SetNodeStyle(tree, style = "filled,rounded", shape = "box", fillcolor = "GreenYellow")
plot(tree)

【讨论】：