parLapply 和词性标注答案

【问题标题】：parLapply and Part of Speech taggingparLapply 和词性标注
【发布时间】：2018-09-10 03:52:00
【问题描述】：

我正在尝试使用 parLapply 和 openNLP R 包对约 600k 文档的语料库进行词性标记。然而，虽然我能够成功地为一组不同的约 90k 文档标记词性，但在约 600k 文档上运行相同代码约 25 分钟后，我得到了一个奇怪的错误：

Error in checkForRemoteErrors(val) : 10 nodes produced errors; first error: no word token annotations found

这些文档只是数字报纸文章，我在正文字段上运行标记器（清洁后）。这个字段只是我保存到字符串列表中的原始文本。

这是我的代码：

# I set the Java heap size (memory) allocation - I experimented with different sizes
options(java.parameters = "- Xmx3GB")
# Convert the corpus into a list of strings
myCorpus <- lapply(contentCleaned, function(x){x <- as.String(x)})

# tag Corpus Function
tagCorpus <- function(x, ...){
    s <- as.String(x) # This is a repeat and may not be required
    WTA <- Maxent_Word_Token_Annotator()
    a2 <- Annotation(1L, "sentence", 1L, nchar(s))
    a2 <- annotate(s, WTA, a2)
    a3 <- annotate(s, PTA, a2)
    word_subset <- a3[a3$type == "word"]
    POStags <- unlist(lapply(word_subset$features, `[[`, "POS"))
    POStagged <- paste(sprintf("%s/%s", s[word_subset], POStags), collapse   = " ")
    list(text = s, POStagged = POStagged, POStags = POStags, words = s[word_subset])
}

# I have 12 cores in my box
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()-2))

# I tried both exporting the word token annotator and not
clusterEvalQ(cl, {
    library(openNLP);
    library(NLP);
    PTA <- Maxent_POS_Tag_Annotator();
    WTA <- Maxent_Word_Token_Annotator()
})

# Each cluster node has the following description:
[[1]]
An annotator inheriting from classes
    Simple_Word_Token_Annotator Annotator
    with description
    Computes word token annotations using the Apache OpenNLP Maxent tokenizer employing the default model for language 'en'.

clusterEvalQ(cl, sessionInfo())

# ClusterEvalQ outputs for each worker:

[[1]]
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
  [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                    LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
  [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
  [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] NLP_0.1-11    openNLP_0.2-6

loaded via a namespace (and not attached):
[1] openNLPdata_1.5.3-4 compiler_3.4.4      parallel_3.4.4      rJava_0.9-10    

packageDescription('openNLP') # Version: 0.2-6
packageDescription('parallel') # Version: 3.4.4

startTime <- Sys.time()
print(startTime)
corpus.tagged <- parLapply(cl, myCorpus, tagCorpus)
endTime <- Sys.time()
print(endTime)
endTime - startTime

请注意，我曾咨询过许多网络论坛，其中最突出的是： parallel parLapply setup

但是，这似乎并没有解决我的问题。此外，我很困惑为什么该设置适用于 ~90k 文章而不适用于 ~600k 文章（我总共有 12 个内核和 64GB 内存）。非常感谢任何建议。

【问题讨论】：

标签： r parallel-processing

【解决方案1】：

我已经设法通过直接使用 Tyler Rinker 的 qdap 包 (https://github.com/trinker/qdap) 来实现这一点。运行大约需要 20 个小时。以下是 qdap 包中的函数 pos 如何在一个内衬中执行此操作：

corpus.tagged <- qdap::pos(myCorpus, parallel =TRUE, cores =detectCores()-2)

【讨论】：