（如何）我可以在 OpenNLP 文档分类器中使用 Bigram 功能吗答案

【问题标题】：(How) Can I use Bigram Features with the OpenNLP Document Classifier（如何）我可以在 OpenNLP 文档分类器中使用 Bigram 功能吗
【发布时间】：2013-08-01 15:58:15
【问题描述】：

（如何）我可以在 OpenNLP 文档分类器中使用 Bigram 功能吗？

我有一组非常短的文档（标题、短语和句子），我想添加在工具 LibShortText 中使用的二元组特征

http://www.csie.ntu.edu.tw/~cjlin/libshorttext/

这可能吗？

文档仅说明了如何使用名称查找器使用

BigramNameFeatureGenerator()

而不是文档分类器

【问题讨论】：

标签： opennlp

【解决方案1】：

我相信训练器和分类器允许在他们的方法中使用自定义特征生成器，但是它们必须是 FeatureGenerator 的实现，而 BigramFeatureGenerator 不是那个的实现。所以我在下面做了一个快速的 impl 作为内部类.. 所以当你有机会时试试这个（未经测试的）代码

    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import java.util.ArrayList;
    import java.util.Arrays;
    import java.util.Collection;
    import java.util.Collections;
    import java.util.List;
    import opennlp.tools.doccat.DoccatModel;
    import opennlp.tools.doccat.DocumentCategorizerME;
    import opennlp.tools.doccat.DocumentSample;
    import opennlp.tools.doccat.DocumentSampleStream;
    import opennlp.tools.doccat.FeatureGenerator;
    import opennlp.tools.util.ObjectStream;
    import opennlp.tools.util.PlainTextByLineStream;



    public class DoccatUsingBigram {

      public static void main(String[] args) throws IOException {
        InputStream dataIn = new FileInputStream(args[0]);
        try {


          ObjectStream<String> lineStream =
                  new PlainTextByLineStream(dataIn, "UTF-8");
//here you can use it as part of building the model
          ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
          DoccatModel model = DocumentCategorizerME.train("en", sampleStream, 10, 100, new MyBigramFeatureGenerator());


          ///now you would use it like this

          DocumentCategorizerME classifier = new DocumentCategorizerME(model);
          String[] someData = "whatever you are trying to classify".split(" ");
          Collection<String> bigrams = new MyBigramFeatureGenerator().extractFeatures(someData);
          double[] categorize = classifier.categorize(bigrams.toArray(new String[bigrams.size()]));


        } catch (IOException e) {
          // Failed to read or parse training data, training failed
          e.printStackTrace();
        }

      }

      public static class MyBigramFeatureGenerator implements FeatureGenerator {

        @Override
        public Collection<String> extractFeatures(String[] text) {
          return generate(Arrays.asList(text), 2, "");
        }

        private  List<String> generate(List<String> input, int n, String separator) {

          List<String> outGrams = new ArrayList<String>();
          for (int i = 0; i < input.size() - (n - 2); i++) {
            String gram = "";
            if ((i + n) <= input.size()) {
              for (int x = i; x < (n + i); x++) {
                gram += input.get(x) + separator;
              }
              gram = gram.substring(0, gram.lastIndexOf(separator));
              outGrams.add(gram);
            }
          }
          return outGrams;
        }
      }
    }

希望这会有所帮助...

【讨论】：

【解决方案2】：

您可以在 OpenNLP[1] 中为您的用例使用 NGramFeatureGenerator.java 类。

[1]https://github.com/apache/opennlp

谢谢，玛达瓦

【讨论】：