在弹性搜索中进行标记时从文本中删除符号答案

【问题标题】：Remove symbols from text while tokenising in elastic search在弹性搜索中进行标记时从文本中删除符号
【发布时间】：2020-04-23 12:53:05
【问题描述】：

我正在尝试创建一个以某种方式标记数据的弹性搜索文档。对于像Los Angeles (and vicinity), California, United States of America 这样的字符串，我希望排除( ) , 等符号，并且只包括字母数字字符。

https://www.elastic.co/guide/en/elasticsearch/reference/7.x/analysis-keep-types-tokenfilter.html#analysis-keep-types-tokenfilter-customize

我的ES索引设置如下

PUT /test-index
{
  "settings": {
    "analysis": {
      "filter": {
        "extract_alpha": {
          "type": "keep_types",
          "mode": "include",
          "types": [
            "<ALPHANUM>"
          ]
        }
      },
      "analyzer": {
        "my_autocomplete": { 
          "type":"custom",
          "tokenizer":"my_tokenizer",
          "filter" : [
            "lowercase",
            "extract_alpha"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer" : {
          "type": "whitespace"
        }
      }
    }
   },
   "mappings": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "my_autocomplete"
        }
      } 
  }
}

但是，当我运行此查询时

GET test-index/_analyze
{
  "analyzer": "my_autocomplete",
  "text": "Los Angeles (and vicinity), California, United States of America"
}

我得到了输出

{
  "tokens" : [ ]
}

如果我删除 extract_alpha 过滤器，我会得到令牌但包含符号

{
  "tokens" : [
    {
      "token" : "los",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "angeles",
      "start_offset" : 4,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "(and",
      "start_offset" : 12,
      "end_offset" : 16,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "vicinity),",
      "start_offset" : 17,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "california,",
      "start_offset" : 28,
      "end_offset" : 39,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "united",
      "start_offset" : 40,
      "end_offset" : 46,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "states",
      "start_offset" : 47,
      "end_offset" : 53,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "of",
      "start_offset" : 54,
      "end_offset" : 56,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "america",
      "start_offset" : 57,
      "end_offset" : 64,
      "type" : "word",
      "position" : 8
    }
  ]
}

我该如何解决这个问题，我做错了什么？

【问题讨论】：

标签： elasticsearch

【解决方案1】：

我不知道你为什么要创建自定义分析器，这是由default standard analyzer 处理的，如下所示：

分析请求

POST http://{{hostname}}:{{port}}/{{index-name}}/_analyze
{
    "text" : "Los Angeles (and vicinity), California, United States of America",
    "analyzer" : "standard"
}

以及生成的令牌

{
    "tokens": [
        {
            "token": "los",
            "start_offset": 0,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "angeles",
            "start_offset": 4,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "and",
            "start_offset": 13,
            "end_offset": 16,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "vicinity",
            "start_offset": 17,
            "end_offset": 25,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "california",
            "start_offset": 28,
            "end_offset": 38,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "united",
            "start_offset": 40,
            "end_offset": 46,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "states",
            "start_offset": 47,
            "end_offset": 53,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "of",
            "start_offset": 54,
            "end_offset": 56,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "america",
            "start_offset": 57,
            "end_offset": 64,
            "type": "<ALPHANUM>",
            "position": 8
        }
    ]
}

【讨论】：