【问题标题】:Remove symbols from text while tokenising in elastic search在弹性搜索中进行标记时从文本中删除符号
【发布时间】:2020-04-23 12:53:05
【问题描述】:

我正在尝试创建一个以某种方式标记数据的弹性搜索文档。对于像Los Angeles (and vicinity), California, United States of America 这样的字符串,我希望排除( ) , 等符号,并且只包括字母数字字符。

https://www.elastic.co/guide/en/elasticsearch/reference/7.x/analysis-keep-types-tokenfilter.html#analysis-keep-types-tokenfilter-customize

我的ES索引设置如下

PUT /test-index
{
  "settings": {
    "analysis": {
      "filter": {
        "extract_alpha": {
          "type": "keep_types",
          "mode": "include",
          "types": [
            "<ALPHANUM>"
          ]
        }
      },
      "analyzer": {
        "my_autocomplete": { 
          "type":"custom",
          "tokenizer":"my_tokenizer",
          "filter" : [
            "lowercase",
            "extract_alpha"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer" : {
          "type": "whitespace"
        }
      }
    }
   },
   "mappings": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "my_autocomplete"
        }
      } 
  }
}

但是,当我运行此查询时

GET test-index/_analyze
{
  "analyzer": "my_autocomplete",
  "text": "Los Angeles (and vicinity), California, United States of America"
}

我得到了输出

{
  "tokens" : [ ]
}

如果我删除 extract_alpha 过滤器,我会得到令牌但包含符号

{
  "tokens" : [
    {
      "token" : "los",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "angeles",
      "start_offset" : 4,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "(and",
      "start_offset" : 12,
      "end_offset" : 16,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "vicinity),",
      "start_offset" : 17,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "california,",
      "start_offset" : 28,
      "end_offset" : 39,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "united",
      "start_offset" : 40,
      "end_offset" : 46,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "states",
      "start_offset" : 47,
      "end_offset" : 53,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "of",
      "start_offset" : 54,
      "end_offset" : 56,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "america",
      "start_offset" : 57,
      "end_offset" : 64,
      "type" : "word",
      "position" : 8
    }
  ]
}

我该如何解决这个问题,我做错了什么?

【问题讨论】:

    标签: elasticsearch


    【解决方案1】:

    我不知道你为什么要创建自定义分析器,这是由default standard analyzer 处理的,如下所示:

    分析请求

    POST http://{{hostname}}:{{port}}/{{index-name}}/_analyze
    {
        "text" : "Los Angeles (and vicinity), California, United States of America",
        "analyzer" : "standard"
    }
    

    以及生成的令牌

    {
        "tokens": [
            {
                "token": "los",
                "start_offset": 0,
                "end_offset": 3,
                "type": "<ALPHANUM>",
                "position": 0
            },
            {
                "token": "angeles",
                "start_offset": 4,
                "end_offset": 11,
                "type": "<ALPHANUM>",
                "position": 1
            },
            {
                "token": "and",
                "start_offset": 13,
                "end_offset": 16,
                "type": "<ALPHANUM>",
                "position": 2
            },
            {
                "token": "vicinity",
                "start_offset": 17,
                "end_offset": 25,
                "type": "<ALPHANUM>",
                "position": 3
            },
            {
                "token": "california",
                "start_offset": 28,
                "end_offset": 38,
                "type": "<ALPHANUM>",
                "position": 4
            },
            {
                "token": "united",
                "start_offset": 40,
                "end_offset": 46,
                "type": "<ALPHANUM>",
                "position": 5
            },
            {
                "token": "states",
                "start_offset": 47,
                "end_offset": 53,
                "type": "<ALPHANUM>",
                "position": 6
            },
            {
                "token": "of",
                "start_offset": 54,
                "end_offset": 56,
                "type": "<ALPHANUM>",
                "position": 7
            },
            {
                "token": "america",
                "start_offset": 57,
                "end_offset": 64,
                "type": "<ALPHANUM>",
                "position": 8
            }
        ]
    }
    

    【讨论】:

      猜你喜欢
      • 2020-05-06
      • 1970-01-01
      • 1970-01-01
      • 2014-04-15
      • 2012-11-04
      • 2017-09-25
      • 2016-03-14
      • 1970-01-01
      • 2017-07-03
      相关资源
      最近更新 更多