为什么 OpenNLP POSTaggerME 这么慢？答案

【问题标题】：Why OpenNLP POSTaggerME is so slow?为什么 OpenNLP POSTaggerME 这么慢？
【发布时间】：2014-07-16 11:34:05
【问题描述】：

这个问题在这里被问了两次，没有得到任何答案。我会尽量提供更多信息。

问题：我决定用 Java 重写一个词性 (POS) 标记器，认为它应该比我用 python 编写的 POS 标记器快得多。为此，我决定使用 OpenNLP POSTaggerME 标记器。然而，在对几个文本文件运行POSTaggerME 之后，我得出的结论是，这个标记器比我在 python 中使用的不太准确的标记器慢得多。例如，在英特尔 987 1.5Ghz 4GB RAM 笔记本电脑上标记“爱丽丝梦游仙境”需要 3 分钟，而在 office i5 核心 3.3Ghz 16GB RAM 机器上标记“爱丽丝梦游仙境”需要 74s。 NLTK unigram pos-tagger 只需要不到一秒钟的时间。

问题：由于我只学习 Java，我怀疑我的代码没有优化，这可能是导致速度下降的原因。这当然可能是由于 POSTaggerME 太慢了，但我非常怀疑。

您能告诉我下面的代码是否存在可能导致 pos-tagging 速度变慢的问题吗？

以下是我认为可能会导致性能下降的类。完整的 Github maven 项目在这里：https://github.com/tastyminerals/POS-search-tool.git

主类

imports (...)

public class MainApp {
public static void main(String[] args) {
    // Speed benchmark
    long start_time = System.currentTimeMillis();

    String file = "test/Alice_in_Wonderland.docx";
    Pair<String, ArrayList<String>> data = null;
    String sents[] = null;
    FileService fs = new FileService();

    /*
     * FileService returns a tuple with file textual data and an ArrayList
     * of file meta data
     */
    try {
        data = fs.getFileData(file);
    } catch (IOException | SAXException | TikaException e) {
        e.printStackTrace();
    }

    // Detecting sentences in data
    try {

        sents = SentDetection.getSents(data.getValue0());
    } catch (IOException e) {
        e.printStackTrace();
    }

    long end_time1 = System.currentTimeMillis();
    long difference = (end_time1 - start_time);
    System.out.println("SentDetection time: " + difference);

    // Tokenizing extracted sentences
    String[] ts = null;
    String[] tgs = null;
    try {
        //Loading model outside of POSTagging class to save resources
        POSModel model = new POSModelLoader().load(new File(
                "resources/models/pos/en-pos-maxent.bin"));

        for (String s: sents) {
            ts = Tokenizing.tokenize(s);
            tgs = POSTagging.tag(s, ts, model);


    //Printing the results          
//              int i = 0;
//              for (String t: ts) {
//                  System.out.print(t + "_" + tgs[i] + " ");
//                  i += 1;
//              }
//              System.out.println("");
            }


    } catch (IOException e) {
        e.printStackTrace();
    }

    // Speed benchmark
    long end_time3 = System.currentTimeMillis();
    long difference3 = (end_time3 - start_time) / 1000;
    System.out.println("POSTagging time: " + difference3 + "s");

   }
}

分词器类

imports (...)
public class Tokenizing {
    public static String[] tokenize(String sentence)
            throws InvalidFormatException, IOException {
        // Load the corresponding tokenizer model
        InputStream is = new FileInputStream(
                "resources/models/token-detection/en-token.bin");
        TokenizerModel tmodel = new TokenizerModel(is);

        // Instantiate TokenizerME with a trained model and tokenize string
        Tokenizer tokenizer = new TokenizerME(tmodel);
        String tokens[] = tokenizer.tokenize(sentence);
        is.close();

        return tokens;
    }
}

POSTagger 类

imports (...)
public class POSTagging {
    public static String[] tag(String sentence, String[] tokenizedSent,
            POSModel model) throws InvalidFormatException, IOException {
        // PerformanceMonitor perfMon = new PerformanceMonitor(System.err,
        // "sent");

        POSTaggerME tagger = new POSTaggerME(model);

        String[] taggedSent = tagger.tag(tokenizedSent);

        // System.out.println(Arrays.toString(taggedSent));
        // System.out.println(Arrays.toString(tokenizedSent));
        return taggedSent;
    }
}

【问题讨论】：

标签： java performance opennlp pos-tagger

【解决方案1】：

您的测试代码正在计算加载模型所花费的时间以及将它们实际应用于文本所花费的时间。更糟糕的是，您为每个句子重新加载标记器模型一次，而不是预先加载一次然后多次应用它。

如果您想获得可靠的测量结果，您需要重构代码以首先加载所有模型，然后再开始计时，然后运行序列数百或数千次并取平均值。

【讨论】：

如果我将“en-token.bin”模型移动到MainApp 类并只加载一次，pos-tagging 速度就会恢复到我的预期。