【问题标题】:Add leading/trailing space to elasticsearch tokenizer ngram向 elasticsearch tokenizer ngram 添加前导/尾随空格
【发布时间】:2021-02-02 21:09:59
【问题描述】:

我正在尝试使用 elasticsearch 分析器生成 ngram 功能,特别是,我想在单词中添加前导/尾随空格。例如,如果单词是“2 Quick Foxes”,则带有前导/尾随空格的 ngram 特征将是:

“2”、“2 Q”、.....、“Fox”、“oxe”、“xes”、“es”

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2 Quick Foxes"
}

【问题讨论】:

    标签: elasticsearch n-gram


    【解决方案1】:

    您可以添加两个pattern replace character filters - 一个用于前导空格,另一个用于尾随:

    PUT my-index-000001
    {
      "settings": {
        "index": {
          "analysis": {
            "analyzer": {
              "my_analyzer": {
                "tokenizer": "my_tokenizer",
                "char_filter": [
                  "leading_space",
                  "trailing_space"
                ]
              }
            },
            "tokenizer": {
              "my_tokenizer": {
                "type": "ngram",
                "min_gram": 3,
                "max_gram": 3,
                "token_chars": [
                  "letter",
                  "digit",
                  "whitespace"       
                ]
              }
            },
            "char_filter": {
              "leading_space": {
                "type": "pattern_replace",
                "pattern": "(^.)",
                "replacement": " $1"
              },
              "trailing_space": {
                "type": "pattern_replace",
                "pattern": "(.$)",
                "replacement": "$1 "
              }
            }
          }
        }
      }
    }
    

    注意在my_tokenizertoken_chars 中添加了whitespace -- 如果没有它,上述内容将无法工作。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-01-20
      • 2019-12-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多