【问题标题】:OpenNLP classifier outputOpenNLP 分类器输出
【发布时间】:2018-05-18 01:17:39
【问题描述】:

目前我正在使用以下代码来训练分类器模型:

    final String iterations = "1000";
    final String cutoff = "0";
    InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("src/main/resources/trainingSets/classifierA.txt"));
    ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
    ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

    TrainingParameters params = new TrainingParameters();
    params.put(TrainingParameters.ITERATIONS_PARAM, iterations);
    params.put(TrainingParameters.CUTOFF_PARAM, cutoff);
    params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);

    DoccatModel model = DocumentCategorizerME.train("NL", sampleStream, params, new DoccatFactory());

    OutputStream modelOut = new BufferedOutputStream(new FileOutputStream("src/main/resources/models/model.bin"));
    model.serialize(modelOut);

    return model;

一切顺利,每次运行后我都会得到以下输出:

    Indexing events with TwoPass using cutoff of 0

    Computing event counts...  done. 1474 events
    Indexing...  done.
Collecting events... Done indexing in 0,03 s.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 1474
        Number of Outcomes: 2
      Number of Predicates: 4149
Computing model parameters...
Stats: (998/1474) 0.6770691994572592
...done.

有人能解释一下这个输出是什么意思吗?如果它说明了准确性?

【问题讨论】:

    标签: java text machine-learning opennlp categorization


    【解决方案1】:

    查看source,我们可以看出这个输出是由NaiveBayesTrainer::trainModel方法完成的:

    public AbstractModel trainModel(DataIndexer di) {
        // ...
        display("done.\n");
        display("\tNumber of Event Tokens: " + numUniqueEvents + "\n");
        display("\t    Number of Outcomes: " + numOutcomes + "\n");
        display("\t  Number of Predicates: " + numPreds + "\n");
        display("Computing model parameters...\n");
        MutableContext[] finalParameters = findParameters();
        display("...done.\n");
        // ...
    }
    

    如果您查看findParameters() 代码,您会注意到它调用了trainingStats() 方法,其中包含计算准确度的代码sn-p:

    private double trainingStats(EvalParameters evalParams) {
        // ...
        double trainingAccuracy = (double) numCorrect / numEvents;
        display("Stats: (" + numCorrect + "/" + numEvents + ") " + trainingAccuracy + "\n");
        return trainingAccuracy;
    }
    

    TL;DR 输出的Stats: (998/1474) 0.6770691994572592 部分是您正在寻找的准确度。

    【讨论】:

    • 感谢您的好回答,我还有 1 个问题。 numCorrect 基于哪里?在这个训练集中有998 数字2,其余的在数字4 之下。为什么2numCorrect 的数字?
    • @Patrick numCorrect 也在trainingStats() 中计算。看看the source in GitHub
    • @Patrick 如果您发现此答案有用,请不要忘记通过单击向上/向下箭头下方的耐克徽标来“接受”它。 :-)
    • 是的,当然,我只是在测试我现在是否理解它。
    • DocumentCategorizerME::categorize 返回docWords 属于每个类别的概率。
    猜你喜欢
    • 2013-11-07
    • 2018-05-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-09-26
    • 2017-08-12
    • 2019-02-14
    • 2021-01-16
    相关资源
    最近更新 更多