【问题标题】:How many lines and documents should be there in the training data opennlp categorizer训练数据 opennlp 分类器中应该有多少行和文档
【发布时间】:2015-07-22 00:15:56
【问题描述】:

我正在关注documentation for Apache open-nlp。我能够理解句子检测、分词器、名称查找器。但是我被分类器卡住了。原因,我不明白,如何创建分类模型。

我明白我需要创建一个文件。格式很明确,需要一个分类空间和一个单行文档。以.train 扩展名保存文件。

所以我创建了以下文件:

Refund What is the refund status for my order #342 ?
NewOffers Are there any new offers for your products ?

我给了这个命令-

opennlp DoccatTrainer -model en-doccat.bin -lang en -data en-doccat.train -encoding UTF-8

它开始做某事,然后返回一个错误。这些是命令提示符中的内容:

Indexing events using cutoff of 5

    Computing event counts...  done. 2 events
    Indexing...  Dropped event Refund:[bow=What, bow=is, bow=the, bow=refund, bow=status, bow=for, bow=my, bow=order, bow=#342, bow=?]
Dropped event NewOffers:[bow=Are, bow=there, bow=any, bow=new, bow=offers, bow=for, bow=your, bow=products, bow=?]
done.
Sorting and merging events... Done indexing.
Incorporating indexed data for training...  
Exception in thread "main" java.lang.NullPointerException
    at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:263)
    at opennlp.maxent.GIS.trainModel(GIS.java:256)
    at opennlp.model.TrainUtil.train(TrainUtil.java:184)
    at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:162)
    at opennlp.tools.cmdline.doccat.DoccatTrainerTool.run(DoccatTrainerTool.java:61)
    at opennlp.tools.cmdline.CLI.main(CLI.java:222)

我只是无法弄清楚为什么这会在这里给出空指针异常?我也尝试增加两行,但没有结果。

Refund What is the refund status for my order #342 ?
NewOffers Are there any new offers for your products ?
Refund Can I place a refund request for electronics ?
NewOffers Is there any new offer on buying worth 5000 ?  

我找到了this blog,但这里也做了几乎相同的事情。在尝试他的训练文件时,它很有魅力。我的文件有什么问题?我该如何解决该错误。

当我尝试opennlp DoccatTrainer 时,它会为我打开帮助,所以路径不是问题。任何帮助表示赞赏。

EDIT:我把文件改成了

Refund What is the refund status for my order #342 ? Can I place a refund request for clothes ?
NewOffers Are there any new offers for your products ? what are the offers on new products or new offers on old products?
Refund Can I place a refund request for electronics ?
NewOffers Is there any new offer on buying worth 5000 ? 

它有效,我认为它必须对文档做一些事情(显然应该是两个句子)并删除了最后两行。

实现它

Refund What is the refund status for my order #342 ? Can I place a refund request for clothes ?
NewOffers Are there any new offers for your products ? what are the offers on new products or new offers on old products? 

但是又失败了,现在的问题总结为它需要什么样的数据/格式/文档?

谢谢

【问题讨论】:

    标签: opennlp


    【解决方案1】:

    您可以在 DoccatTrainer 命令中使用 -cutoff 标志来更改默认值。在您的情况下,您将添加 -cutoff 1 以将每个类别的最小文档数设置为 1。

    【讨论】:

    • -cutoff 不起作用与 openNLP 的 DoccatTrainer
    【解决方案2】:

    您必须从每个类别中添加 5 个以上的样本。因为默认的截止标记大小是 5,

    请参考这篇博文 http://madhawagunasekara.blogspot.com/2014/11/nlp-categorizer.html

    【讨论】:

    • 总共需要 5 个样本(不是 5 个样本/类别)。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-04-19
    • 2018-04-02
    • 2013-11-07
    • 2016-10-09
    • 2016-07-03
    • 2013-08-12
    相关资源
    最近更新 更多