【问题标题】:(How) Can I use Bigram Features with the OpenNLP Document Classifier(如何)我可以在 OpenNLP 文档分类器中使用 Bigram 功能吗
【发布时间】:2013-08-01 15:58:15
【问题描述】:

(如何)我可以在 OpenNLP 文档分类器中使用 Bigram 功能吗?

我有一组非常短的文档(标题、短语和句子),我想添加在工具 LibShortText 中使用的二元组特征

http://www.csie.ntu.edu.tw/~cjlin/libshorttext/

这可能吗?

文档仅说明了如何使用名称查找器使用

BigramNameFeatureGenerator()

而不是文档分类器

【问题讨论】:

    标签: opennlp


    【解决方案1】:

    我相信训练器和分类器允许在他们的方法中使用自定义特征生成器,但是它们必须是 FeatureGenerator 的实现,而 BigramFeatureGenerator 不是那个的实现。所以我在下面做了一个快速的 impl 作为内部类.. 所以当你有机会时试试这个(未经测试的)代码

        import java.io.FileInputStream;
        import java.io.IOException;
        import java.io.InputStream;
        import java.util.ArrayList;
        import java.util.Arrays;
        import java.util.Collection;
        import java.util.Collections;
        import java.util.List;
        import opennlp.tools.doccat.DoccatModel;
        import opennlp.tools.doccat.DocumentCategorizerME;
        import opennlp.tools.doccat.DocumentSample;
        import opennlp.tools.doccat.DocumentSampleStream;
        import opennlp.tools.doccat.FeatureGenerator;
        import opennlp.tools.util.ObjectStream;
        import opennlp.tools.util.PlainTextByLineStream;
    
    
    
        public class DoccatUsingBigram {
    
          public static void main(String[] args) throws IOException {
            InputStream dataIn = new FileInputStream(args[0]);
            try {
    
    
              ObjectStream<String> lineStream =
                      new PlainTextByLineStream(dataIn, "UTF-8");
    //here you can use it as part of building the model
              ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
              DoccatModel model = DocumentCategorizerME.train("en", sampleStream, 10, 100, new MyBigramFeatureGenerator());
    
    
              ///now you would use it like this
    
              DocumentCategorizerME classifier = new DocumentCategorizerME(model);
              String[] someData = "whatever you are trying to classify".split(" ");
              Collection<String> bigrams = new MyBigramFeatureGenerator().extractFeatures(someData);
              double[] categorize = classifier.categorize(bigrams.toArray(new String[bigrams.size()]));
    
    
            } catch (IOException e) {
              // Failed to read or parse training data, training failed
              e.printStackTrace();
            }
    
          }
    
          public static class MyBigramFeatureGenerator implements FeatureGenerator {
    
            @Override
            public Collection<String> extractFeatures(String[] text) {
              return generate(Arrays.asList(text), 2, "");
            }
    
            private  List<String> generate(List<String> input, int n, String separator) {
    
              List<String> outGrams = new ArrayList<String>();
              for (int i = 0; i < input.size() - (n - 2); i++) {
                String gram = "";
                if ((i + n) <= input.size()) {
                  for (int x = i; x < (n + i); x++) {
                    gram += input.get(x) + separator;
                  }
                  gram = gram.substring(0, gram.lastIndexOf(separator));
                  outGrams.add(gram);
                }
              }
              return outGrams;
            }
          }
        }
    

    希望这会有所帮助...

    【讨论】:

      【解决方案2】:

      您可以在 OpenNLP[1] 中为您的用例使用 NGramFeatureGenerator.java 类。

      [1]https://github.com/apache/opennlp

      谢谢, 玛达瓦

      【讨论】:

        猜你喜欢
        • 2013-11-07
        • 1970-01-01
        • 1970-01-01
        • 2019-05-23
        • 1970-01-01
        • 2021-03-01
        • 2011-08-30
        • 2013-08-13
        • 2017-12-27
        相关资源
        最近更新 更多