带分隔符的 Lucene 多字标记答案

【问题标题】：Lucene multi word tokens with delimiter带分隔符的 Lucene 多字标记
【发布时间】：2014-10-13 21:21:34
【问题描述】：

我刚开始使用 Lucene，所以这可能是一个初学者的问题。我们正在尝试在数字书籍上实现语义搜索，并且已经有了一个概念生成器，因此例如我为一篇新文章生成的上下文可能是： |绿豆 |葱 |烹饪 | 我正在使用 Lucene 仅使用提取的概念（为此目的存储在临时文档中）在书籍/文章上创建索引。现在标准分析器正在创建单字标记：Green、Beans、Spring、Onions、Cooking，当然不一样。

我的问题：是否有能够检测标记周围的分隔符（在我们的示例中为 ||）的分析器，或者能够检测多词结构的分析器？

恐怕我们必须创建自己的分析器，但我不知道从哪里着手。

【问题讨论】：

标签： java lucene token analyzer

【解决方案1】：

创建分析器非常简单。分析器只是一个标记器，可选地跟随标记过滤器。在您的情况下，您必须创建自己的标记器。幸运的是，您有一个方便的基类：CharTokenizer。

您实现isTokenChar 方法并确保它在| 字符上返回false，在任何其他字符上返回true。其他所有内容都将被视为令牌的一部分。

一旦你有了分词器，分析器应该很简单，只需查看任何现有分析器的源代码并照此操作。

哦，如果您的 | 字符之间可以有空格，只需在分析器中添加一个 TrimFilter。

【讨论】：

嗨，是的，我担心我们不得不这样做。我只是在查看 org.apache.lucene.analysis.pattern，在我们的案例中，它们似乎已经提供了一个基于 regex = (|[^|]+|) 的标记器。我想我们应该使用那个标记器并在它之上构建我们自己的分析器。
基于CharTokenizer 的标记器会更快，但如果你想使用PatternTokenizer，我建议使用以下正则表达式：pattern = \s*\|\s* with group = -1。这将拆分正则表达式上的输入字符串，这将在此过程中修剪您的标记。

【解决方案2】：

我遇到这个问题是因为我正在使用我的 Lucene 机制做一些事情，该机制创建与排序有关的数据结构，实际上是“劫持”了 Lucene 类。否则我无法想象为什么人们会想要知道标记之间的分隔符（“定界符”），但由于这很棘手，我想我会把它放在这里，以便任何可能需要的人受益。

您必须重写自己的StandardTokenizer 和StandardTokenizerImpl 版本。这些都是final 类，所以你不能扩展它们。

SeparatorDeliveringTokeniserImpl（根据StandardTokenizerImpl的来源进行了调整）：

3 个新字段：

private int startSepPos = 0;
private int endSepPos = 0;
private String originalBufferAsString;

调整这些方法：

public final void getText(CharTermAttribute t) {
    t.copyBuffer(zzBuffer, zzStartRead, zzMarkedPos - zzStartRead);
    if( originalBufferAsString == null ){
        originalBufferAsString = new String( zzBuffer, 0, zzBuffer.length );
    }
    // startSepPos == -1 is a "flag condition": it means that this token is the last one and it won't be followed by a sep
    if( startSepPos != -1 ){
        // if the flag is NOT set, record the start pos of the next sep...
        startSepPos = zzMarkedPos;
    }
}

public final void yyreset(java.io.Reader reader) {
    zzReader = reader;
    zzAtBOL = true;
    zzAtEOF = false;
    zzEOFDone = false;
    zzEndRead = zzStartRead = 0;
    zzCurrentPos = zzMarkedPos = 0;
    zzFinalHighSurrogate = 0;
    yyline = yychar = yycolumn = 0;
    zzLexicalState = YYINITIAL;
    if (zzBuffer.length > ZZ_BUFFERSIZE)
        zzBuffer = new char[ZZ_BUFFERSIZE];
    // reset fields responsible for delivering separator...
    originalBufferAsString = null;
    startSepPos = 0;
    endSepPos = 0;
}

（在getNextToken:)

if ((zzAttributes & 1) == 1) {
    zzAction = zzState;
    zzMarkedPosL = zzCurrentPosL;
    if ((zzAttributes & 8) == 8) {
        // every occurrence of a separator char leads here...
        endSepPos = zzCurrentPosL;
        break zzForAction;
    }
}

并制作一个新方法：

String getPrecedingSeparator() {
    String sep = null;
    if( originalBufferAsString == null ){
        sep = new String( zzBuffer, 0, endSepPos );
    }
    else if( startSepPos == -1 || endSepPos <= startSepPos ){
        sep = "";
    }
    else {
        sep = originalBufferAsString.substring( startSepPos, endSepPos );
    }
    if( zzMarkedPos < startSepPos ){
        // ... then this is a sign that the next token will be the last one... and will NOT have a trailing separator
        // so set a "flag condition" for next time this method is called
        startSepPos = -1;
    }
    return sep;
}

SeparatorDeliveringTokeniser（根据StandardTokenizer 的来源进行了调整）：

添加这个：

private String separator;
String getSeparator(){
    // normally this delivers a preceding separator... but after incrementToken returns false, if there is a trailing
    // separator, it then delivers that...
    return separator;
}

(内incrementToken:)

while(true) {
  int tokenType = scanner.getNextToken();

  // added NB this gives you the separator which PRECEDES the token
  // which you are about to get from scanner.getText( ... )
  separator = scanner.getPrecedingSeparator();

  if (tokenType == SeparatorDeliveringTokeniserImpl.YYEOF) {
      // NB at this point sep is equal to the trailing separator...
    return false;
  }
  ...

用法：

在我的FilteringTokenFilter 子类中，称为TokenAndSeparatorExamineFilter，方法accept 和end 如下所示：

@Override
public boolean accept() throws IOException {
    String sep = ((SeparatorDeliveringTokeniser) input).getSeparator();
    // a preceding separator can only be an empty String if we are currently 
    // dealing with the first token and if the sequence starts with a token 
    if (!sep.isEmpty()) {
       // ... do something with the preceding separator
    }
    // then get the token...
    String token = getTerm();
    // ... do something with the token

    // my filter does no filtering! Every token is accepted...:
    return true;
}

@Override
public void end() throws IOException {
    // deals with trailing separator at the end of a sequence of tokens and separators (if there is one, i.e. if it doesn't end with a token)
    String sep = ((SeparatorDeliveringTokeniser) input).getSeparator();
    // NB will be an empty String if there is no trailing separator
    if (!sep.isEmpty()) {
        // ... do something with this trailing separator
    }
}

【讨论】：