【发布时间】:2014-07-16 11:34:05
【问题描述】:
这个问题在这里被问了两次,没有得到任何答案。我会尽量提供更多信息。
问题:我决定用 Java 重写一个词性 (POS) 标记器,认为它应该比我用 python 编写的 POS 标记器快得多。为此,我决定使用 OpenNLP POSTaggerME 标记器。然而,在对几个文本文件运行POSTaggerME 之后,我得出的结论是,这个标记器比我在 python 中使用的不太准确的标记器慢得多。例如,在英特尔 987 1.5Ghz 4GB RAM 笔记本电脑上标记“爱丽丝梦游仙境”需要 3 分钟,而在 office i5 核心 3.3Ghz 16GB RAM 机器上标记“爱丽丝梦游仙境”需要 74s。 NLTK unigram pos-tagger 只需要不到一秒钟的时间。
问题:由于我只学习 Java,我怀疑我的代码没有优化,这可能是导致速度下降的原因。这当然可能是由于 POSTaggerME 太慢了,但我非常怀疑。
您能告诉我下面的代码是否存在可能导致 pos-tagging 速度变慢的问题吗?
以下是我认为可能会导致性能下降的类。完整的 Github maven 项目在这里:https://github.com/tastyminerals/POS-search-tool.git
主类
imports (...)
public class MainApp {
public static void main(String[] args) {
// Speed benchmark
long start_time = System.currentTimeMillis();
String file = "test/Alice_in_Wonderland.docx";
Pair<String, ArrayList<String>> data = null;
String sents[] = null;
FileService fs = new FileService();
/*
* FileService returns a tuple with file textual data and an ArrayList
* of file meta data
*/
try {
data = fs.getFileData(file);
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
// Detecting sentences in data
try {
sents = SentDetection.getSents(data.getValue0());
} catch (IOException e) {
e.printStackTrace();
}
long end_time1 = System.currentTimeMillis();
long difference = (end_time1 - start_time);
System.out.println("SentDetection time: " + difference);
// Tokenizing extracted sentences
String[] ts = null;
String[] tgs = null;
try {
//Loading model outside of POSTagging class to save resources
POSModel model = new POSModelLoader().load(new File(
"resources/models/pos/en-pos-maxent.bin"));
for (String s: sents) {
ts = Tokenizing.tokenize(s);
tgs = POSTagging.tag(s, ts, model);
//Printing the results
// int i = 0;
// for (String t: ts) {
// System.out.print(t + "_" + tgs[i] + " ");
// i += 1;
// }
// System.out.println("");
}
} catch (IOException e) {
e.printStackTrace();
}
// Speed benchmark
long end_time3 = System.currentTimeMillis();
long difference3 = (end_time3 - start_time) / 1000;
System.out.println("POSTagging time: " + difference3 + "s");
}
}
分词器类
imports (...)
public class Tokenizing {
public static String[] tokenize(String sentence)
throws InvalidFormatException, IOException {
// Load the corresponding tokenizer model
InputStream is = new FileInputStream(
"resources/models/token-detection/en-token.bin");
TokenizerModel tmodel = new TokenizerModel(is);
// Instantiate TokenizerME with a trained model and tokenize string
Tokenizer tokenizer = new TokenizerME(tmodel);
String tokens[] = tokenizer.tokenize(sentence);
is.close();
return tokens;
}
}
POSTagger 类
imports (...)
public class POSTagging {
public static String[] tag(String sentence, String[] tokenizedSent,
POSModel model) throws InvalidFormatException, IOException {
// PerformanceMonitor perfMon = new PerformanceMonitor(System.err,
// "sent");
POSTaggerME tagger = new POSTaggerME(model);
String[] taggedSent = tagger.tag(tokenizedSent);
// System.out.println(Arrays.toString(taggedSent));
// System.out.println(Arrays.toString(tokenizedSent));
return taggedSent;
}
}
【问题讨论】:
标签: java performance opennlp pos-tagger