【问题标题】:Elasticsearch Analyzer first 4 and last 4 charactersElasticsearch Analyzer 前 4 个和后 4 个字符
【发布时间】:2019-08-29 17:15:23
【问题描述】:

使用 Elasticsearch,我想指定一个 搜索分析器,其中前 4 个字符和后 4 个字符被标记化。

For example: supercalifragilisticexpialidocious => ["supe", "ious"]

我尝试了如下的 ngram

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 4,
          "max_gram": 4
        }
      }
    }
  }
}

我正在测试分析仪如下

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "supercalifragilisticexpialidocious."
}

然后拿回“超级”……一堆我不想要的东西和“珍贵”。我的问题是我怎样才能只从上面指定的 ngram 标记器中获取第一个和最后一个结果?

{
  "tokens": [
    {
      "token": "supe",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "uper",
      "start_offset": 1,
      "end_offset": 5,
      "type": "word",
      "position": 1
    },
...
    {
      "token": "ciou",
      "start_offset": 29,
      "end_offset": 33,
      "type": "word",
      "position": 29
    },
    {
      "token": "ious",
      "start_offset": 30,
      "end_offset": 34,
      "type": "word",
      "position": 30
    },
    {
      "token": "ous.",
      "start_offset": 31,
      "end_offset": 35,
      "type": "word",
      "position": 31
    }
  ]
}

【问题讨论】:

    标签: elasticsearch query-analyzer


    【解决方案1】:

    实现此目的的一种方法是利用 pattern_capture token filter 并获取前 4 个和后 4 个字符。

    首先,像这样定义你的索引:

    PUT my_index
    {
      "settings": {
        "index": {
          "analysis": {
            "analyzer": {
              "my_analyzer": {
                "type": "custom",
                "tokenizer": "keyword",
                "filter": [
                  "lowercase",
                  "first_last_four"
                ]
              }
            },
            "filter": {
              "first_last_four": {
                "type": "pattern_capture",
                "preserve_original": false,
                "patterns": [
                  """(\w{4}).*(\w{4})"""
                ]
              }
            }
          }
        }
      }
    }
    

    然后,您可以测试新的自定义分析器:

    POST test/_analyze
    {
      "text": "supercalifragilisticexpialidocious",
      "analyzer": "my_analyzer"
    }
    

    并看到您期望的令牌在那里:

    {
      "tokens" : [
        {
          "token" : "supe",
          "start_offset" : 0,
          "end_offset" : 34,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "ious",
          "start_offset" : 0,
          "end_offset" : 34,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    

    【讨论】:

    • 酷,很高兴它有帮助!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-06-04
    • 2021-01-04
    • 1970-01-01
    • 2019-03-27
    • 2011-05-16
    相关资源
    最近更新 更多