使用 lucene 提取关键字时出错答案

【问题标题】：Error in Keyword extraction using lucene使用 lucene 提取关键字时出错
【发布时间】：2014-06-25 08:42:50
【问题描述】：

我对文本提取概念完全陌生。当我在搜索一个示例时，我发现了一个使用 Lucene 实现的示例。我只是试图在eclipse中运行它，但它给出了一个错误。这是我得到的错误：（TokenStream 合同违规：reset()/close() 调用丢失，reset() 调用多次，或子类不调用 super.reset()。有关更多信息，请参阅 TokenStream 类的 Javadocs有关正确消费工作流程的信息）。我直接从网络上发表的文章中获取代码并进行了一些修改，因为首先我想确保代码运行时没有错误，然后再逐个理解它的各个部分。原始代码是从 URL 获取文本，但我将其更改为从定义的字符串获取文本（它在 main 方法中）。由于我使用的是 lucene 4.8 版本，因此我也更改了版本。

我也搜索了错误并做了一些修改，但我仍然收到错误。我这里的代码。你能帮我摆脱这个错误吗？我应该在哪里修改以避免错误。这是我获得代码http://pastebin.com/jNALz7DZ 的链接这是我修改的代码。

public class KeywordsGuesser {

     /** Lucene version. */
     private static Version LUCENE_VERSION = Version.LUCENE_48;

     /**
      * Keyword holder, composed by a unique stem, its frequency, and a set of found corresponding
      * terms for this stem.
      */
    public static class Keyword implements Comparable<Keyword> {

         /** The unique stem. */
         private String stem;

         /** The frequency of the stem. */
         private Integer frequency;

         /** The found corresponding terms for this stem. */
        private Set<String> terms;

         /**
          * Unique constructor.
          * 
          * @param stem The unique stem this instance must hold.
          */
         public Keyword(String stem) {
             this.stem = stem;
            terms = new HashSet<String>();
             frequency = 0;
         }

         /**
          * Add a found corresponding term for this stem. If this term has been already found, it
          * won't be duplicated but the stem frequency will still be incremented.
          * 
          * @param term The term to add.
          */
         private void add(String term) {
             terms.add(term);
             frequency++;
         }

         /**
          * Gets the unique stem of this instance.
          * 
          * @return The unique stem.
          */
         public String getStem() {
             return stem;
         }

         /**
          * Gets the frequency of this stem.
          * 
          * @return The frequency.
          */
         public Integer getFrequency() {
             return frequency;
         }

         /**
          * Gets the list of found corresponding terms for this stem.
          * 
          * @return The list of found corresponding terms.
          */
        public Set<String> getTerms() {
             return terms;
         }

         /**
          * Used to reverse sort a list of keywords based on their frequency (from the most frequent
          * keyword to the least frequent one).
          */
         @Override
         public int compareTo(Keyword o) {
             return o.frequency.compareTo(frequency);
         }

         /**
          * Used to keep unicity between two keywords: only their respective stems are taken into
          * account.
          */
         @Override
         public boolean equals(Object obj) {
             return obj instanceof Keyword && obj.hashCode() == hashCode();
         }

         /**
          * Used to keep unicity between two keywords: only their respective stems are taken into
          * account.
          */
         @Override
         public int hashCode() {
             return Arrays.hashCode(new Object[] { stem });
         }

         /**
          * User-readable representation of a keyword: "[stem] x[frequency]".
          */
         @Override
         public String toString() {
             return stem + " x" + frequency;
         }

     }

     /**
      * Stemmize the given term.
      * 
      * @param term The term to stem.
      * @return The stem of the given term.
      * @throws IOException If an I/O error occured.
      */
     private static String stemmize(String term) throws IOException {

         // tokenize term
         TokenStream tokenStream = new ClassicTokenizer(LUCENE_VERSION, new StringReader(term));
         // stemmize
         tokenStream = new PorterStemFilter(tokenStream);

         Set<String> stems = new HashSet<String>();
         CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
         // for each token
         while (tokenStream.incrementToken()) {
             // add it in the dedicated set (to keep unicity)
             stems.add(token.toString());
         }

         // if no stem or 2+ stems have been found, return null
         if (stems.size() != 1) {
             return null;
         }

         String stem = stems.iterator().next();

         // if the stem has non-alphanumerical chars, return null
         if (!stem.matches("[\\w-]+")) {
             return null;
         }

         return stem;
     }

     /**
      * Tries to find the given example within the given collection. If it hasn't been found, the
      * example is automatically added in the collection and is then returned.
      * 
      * @param collection The collection to search into.
      * @param example The example to search.
      * @return The existing element if it has been found, the given example otherwise.
      */
     private static <T> T find(Collection<T> collection, T example) {
         for (T element : collection) {
             if (element.equals(example)) {
                 return element;
             }
         }
         collection.add(example);
         return example;
     }

     /**
      * Extracts text content from the given URL and guesses keywords within it (needs jsoup parser).
      * 
      * @param The URL to read.
      * @return A set of potential keywords. The first keyword is the most frequent one, the last the
      *         least frequent.
      * @throws IOException If an I/O error occurred.
      * @see <a href="http://jsoup.org/">http://jsoup.org/</a>
      */
     public static List<Keyword> guessFromUrl(String url) throws IOException {
         // get textual content from url
         //Document doc = Jsoup.connect(url).get();
         //String content = doc.body().text();

       String content = url;
         // guess keywords from this content
         return guessFromString(content);
     }

     /**
      * Guesses keywords from given input string.
      * 
      * @param input The input string.
      * @return A set of potential keywords. The first keyword is the most frequent one, the last the
      *         least frequent.
      * @throws IOException If an I/O error occured.
      */
     public static List<Keyword> guessFromString(String input) throws IOException {

         // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
         input = input.replaceAll("-+", "-0");
         // replace any punctuation char but dashes and apostrophes and by a space
         input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");
         // replace most common English contractions
         input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");

         // tokenize input
         TokenStream tokenStream = new ClassicTokenizer(LUCENE_VERSION, new StringReader(input));
         // to lower case
         tokenStream = new LowerCaseFilter(LUCENE_VERSION, tokenStream);
         // remove dots from acronyms (and "'s" but already done manually above)
         tokenStream = new ClassicFilter(tokenStream);
         // convert any char to ASCII
         tokenStream = new ASCIIFoldingFilter(tokenStream);
         // remove english stop words
         tokenStream = new StopFilter(LUCENE_VERSION, tokenStream, EnglishAnalyzer.getDefaultStopSet());

         List<Keyword> keywords = new LinkedList<Keyword>();
         CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

         // for each token
         while (tokenStream.incrementToken()) {
             String term = token.toString();
             // stemmize
             String stem = stemmize(term);
             if (stem != null) {
                 // create the keyword or get the existing one if any
                 Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
                 // add its corresponding initial token
                 keyword.add(term.replaceAll("-0", "-"));
             }
         }



         tokenStream.end();
         tokenStream.close();


         // reverse sort by frequency
         Collections.sort(keywords);

         return keywords;
     }



     public static void main(String args[]) throws IOException{

       String input = "Java is a computer programming language that is concurrent, "
               + "class-based, object-oriented, and specifically designed to have as few "
               + "implementation dependencies as possible. It is intended to let application developers "
               + "write once, run anywhere (WORA), "
               + "meaning that code that runs on one platform does not need to be recompiled "
               + "to run on another. Java applications are typically compiled to byte code (class file) "
               + "that can run on any Java virtual machine (JVM) regardless of computer architecture. "
               + "Java is, as of 2014, one of the most popular programming languages in use, particularly "
               + "for client-server web applications, with a reported 9 million developers."
               + "[10][11] Java was originally developed by James Gosling at Sun Microsystems "
               + "(which has since merged into Oracle Corporation) and released in 1995 as a core "
               + "component of Sun Microsystems' Java platform. The language derives much of its syntax "
               + "from C and C++, but it has fewer low-level facilities than either of them."
               + "The original and reference implementation Java compilers, virtual machines, and "
               + "class libraries were developed by Sun from 1991 and first released in 1995. As of "
               + "May 2007, in compliance with the specifications of the Java Community Process, "
               + "Sun relicensed most of its Java technologies under the GNU General Public License. "
               + "Others have also developed alternative implementations of these Sun technologies, "
               + "such as the GNU Compiler for Java (byte code compiler), GNU Classpath "
               + "(standard libraries), and IcedTea-Web (browser plugin for applets).";

       System.out.println(KeywordsGuesser.guessFromString(input));
     }



 }

这是eclipse输出的错误

Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
    at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:110)
    at org.apache.lucene.analysis.standard.ClassicTokenizerImpl.zzRefill(ClassicTokenizerImpl.java:431)
    at org.apache.lucene.analysis.standard.ClassicTokenizerImpl.getNextToken(ClassicTokenizerImpl.java:638)
    at org.apache.lucene.analysis.standard.ClassicTokenizer.incrementToken(ClassicTokenizer.java:140)
    at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
    at org.apache.lucene.analysis.standard.ClassicFilter.incrementToken(ClassicFilter.java:47)
    at org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter.incrementToken(ASCIIFoldingFilter.java:104)
    at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
    at beehex.lucene.KeywordsGuesser.guessFromString(KeywordsGuesser.java:239)
    at beehex.lucene.KeywordsGuesser.main(KeywordsGuesser.java:288)

摆脱错误后，我的输出是：

[java x10，develop x5，sun x5，run x4，compil x4，languag x3，实现 x3，应用 x3，代码 x3，gnu x3，计算 x2，程序 x2，指定 x2，有 x2，在 x2，平台 x2，字节 x2，类 x2，虚拟 x2，机器 x2，大多数 x2，原点 x2，微系统 x2，ha x2，发布 x2，1995 x2, it x2, from x2, c x2, librari x2, technolog x2, concurr x1, class-bas x1，object-ori x1，设计 x1，少数 x1，依赖 x1，possibl x1，打算 x1, let x1, 写 x1, onc x1, anywher x1, wora x1, mean x1, doe x1，需要 x1，重新编译 x1，anoth x1，典型 x1，文件 x1，可以 x1，ani x1， jvm x1, 不管 x1, 架构师 x1, 2014 x1, 流行 x1, us x1, 特别是 x1, 客户端服务 x1, web x1, 报告 x1, 9 x1, 百万 x1, 10 x1, 11 x1, jame x1, gosl x1, 其中 x1, sinc x1, merg x1, oracl x1, corpor x1, core x1, compon x1, deriv x1, much x1, syntax x1, less x1, 低级 x1, facil x1, 比 x1, 要么 x1, 他们 x1, 参考 x1, 是 x1, 1991 x1, 第一 x1, mai x1, 2007 x1, complianc x1, commun x1, 进程 x1，relicens x1，under x1，gener x1，public x1，licens x1，其他 x1，也 x1，altern x1，类路径 x1，标准 x1，icedtea-web x1, 浏览器 x1, 插件 x1, 小程序 x1]

【问题讨论】：

请打印您的 IDE 输出的确切堆栈跟踪
感谢您的善意公司，但是，对不起，您的想法对我来说不是很清楚。我该怎么办？
Appache Lucene TokenStream contract violation的可能重复

标签： java lucene tokenize feature-extraction

【解决方案1】：

您需要在调用incrementToken 方法之前重置TokenStream 对象，正如错误指出的那样：

// add this line
tokenStream.reset();
while (tokenStream.incrementToken()) {
....

【讨论】：

非常感谢。我正在输出。我也提到了在 stackoverflow 网站上发布的这篇文章。 stackoverflow.com/questions/17447045/… 有没有办法显示这篇文章中提到的输出？我将我的主要方法修改为 System.out.println(KeywordsGuesser.guessFromString(input));
如果您想了解更多有关 lucene 的信息，我建议您看一下 lucene 演示模块：lucene.apache.org/core/4_8_0/demo/overview-summary.html，或者如果您想了解倒排索引，请阅读信息检索书并阅读, 词干等...