使用 Mallet CRF 分类器的 OutOfMemoryError答案

【问题标题】：OutOfMemoryError with Mallet CRF classifier使用 Mallet CRF 分类器的 OutOfMemoryError
【发布时间】：2015-11-19 17:55:03
【问题描述】：

分类器经常因 OutOfMemoryError 而失败。请提出建议。

我们有 UIMA 管道，它调用 5 个模型 jar（基于 mallet CRF），每个大约 30MB。 -Xms 设置为 2G，-Xmx 设置为 4G。

在设置堆空间方面是否有任何指导方针/基准？请指出是否有任何多线程环境的准则。

我确实尝试应用补丁https://code.google.com/p/cleartk/issues/detail?id=408，但这并没有解决问题。

堆转储显示 42% 的堆大小是 char[]，15% 是 String。

java.lang.OutOfMemoryError: Java heap space
    at cc.mallet.types.IndexedSparseVector.setIndex2Location(IndexedSparseVector.java:109)
    at cc.mallet.types.IndexedSparseVector.dotProduct(IndexedSparseVector.java:157)
    at cc.mallet.fst.CRF$TransitionIterator.<init>(CRF.java:1856)
    at cc.mallet.fst.CRF$TransitionIterator.<init>(CRF.java:1835)
    at cc.mallet.fst.CRF$State.transitionIterator(CRF.java:1776)
    at cc.mallet.fst.MaxLatticeDefault.<init>(MaxLatticeDefault.java:252)
    at cc.mallet.fst.MaxLatticeDefault.<init>(MaxLatticeDefault.java:197)
    at cc.mallet.fst.MaxLatticeDefault$Factory.newMaxLattice(MaxLatticeDefault.java:494)
    at cc.mallet.fst.MaxLatticeFactory.newMaxLattice(MaxLatticeFactory.java:11)
    at cc.mallet.fst.Transducer.transduce(Transducer.java:124)
    at org.cleartk.ml.mallet.MalletCrfStringOutcomeClassifier.classify(MalletCrfStringOutcomeClassifier.java:90)

模型是基于 MalletCrfStringOutcomeDataWriter 创建的。

AnalysisEngineFactory.createEngineDescription(DataChunkAnnotator.class,
        CleartkSequenceAnnotator.PARAM_IS_TRAINING, true, DirectoryDataWriterFactory.PARAM_OUTPUT_DIRECTORY,
        options.getModelsDirectory(), DefaultSequenceDataWriterFactory.PARAM_DATA_WRITER_CLASS_NAME, MalletCrfStringOutcomeDataWriter.class)

注释器代码如下所示。

if (this.isTraining()) {
        List<DataAnnotation> namedEntityMentions = JCasUtil.selectCovered(jCas, DataAannotation.class, sentence);
        List<String> outcomes = this.chunking.createOutcomes(jCas, tokens, namedEntityMentions);
        this.dataWriter.write(Instances.toInstances(outcomes, featureLists));
      } else {
        List<String> outcomes = this.classifier.classify(featureLists);
        this.chunking.createChunks(jCas, tokens, outcomes);
      }

谢谢

【问题讨论】：

标签： java out-of-memory mallet

【解决方案1】：

您可以尝试：

增加 Xmx
深入分析堆：所有字符串都由char[] 备份 - 因此知道 42% 和 15% 之类的数字没有帮助 - 您应该调查程序的哪个部分分配了这些字符串。
因为看起来错误是在一行中触发的：
List<String> outcomes = this.classifier.classify(featureLists);
您可以从那里开始：尝试找出featureLists 中的内容，它的大小等，看看方法是什么classify 做，如果你能“帮助”它在记忆方面变得更有效率。例如，减少使用String，将其替换为StringBuilder 和append（仅作为示例）。

【讨论】：