【问题标题】:Search for both numbers and text using a in-built or custom analyzer in elastic search在弹性搜索中使用内置或自定义分析器搜索数字和文本
【发布时间】:2017-10-22 10:28:20
【问题描述】:

这个问题是我之前的this SO 问题的延续。我有一些文本,我想在其上同时搜索数字和文本。

我的文字:-

8080.foobar.getFooLabelFrombar(test.java:91)

我想搜索getFooLabelFrombarfooBar808091

之前我使用simple 分析器,它将上面的文本标记为下面的标记。

 "tokens": [
    {
      "token": "foobar",
      "start_offset": 10,
      "end_offset": 16,
      "type": "word",
      "position": 2
    },
    {
      "token": "getfoolabelfrombar",
      "start_offset": 17,
      "end_offset": 35,
      "type": "word",
      "position": 3
    },
    {
      "token": "test",
      "start_offset": 36,
      "end_offset": 40,
      "type": "word",
      "position": 4
    },
    {
      "token": "java",
      "start_offset": 41,
      "end_offset": 45,
      "type": "word",
      "position": 5
    }
  ]
}

其中,搜索 foobargetFooLabelFrombar 给出了搜索结果,而不是 808091,因为 简单的分析器不会标记数字。 p>

然后按照前面的建议。 SO post,我将分析器更改为Standard,因此数字是可搜索的,但不是其他2字搜索字符串。由于标准分析器将创建以下标记:-

{
  "tokens": [
    {
      "token": "8080",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "foobar.getfoolabelfrombar",
      "start_offset": 5,
      "end_offset": 35,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "test.java",
      "start_offset": 36,
      "end_offset": 45,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "91",
      "start_offset": 46,
      "end_offset": 48,
      "type": "<NUM>",
      "position": 4
    }
  ]
}

我使用了 ES 中所有现有的分析器,但似乎没有任何东西能满足我的要求。我尝试创建下面的自定义分析器,但效果不佳。

{
    "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "letter"
                    "filter" : ["lowercase", "extract_numbers"]
                }
            },
            "filter" : {
                "extract_numbers" : {
                    "type" : "keep_types",
                    "types" : [ "<NUM>","<ALPHANUM>","word"]
                }
            }
        }
}

请建议,我如何构建我的自定义分析器以满足我的要求。

【问题讨论】:

    标签: elasticsearch tokenize elasticsearch-analyzers


    【解决方案1】:

    如何使用字符过滤器将点替换为空格?

    PUT /my_index
    {                                                                                     
      "settings": {                                                                                                                                    
        "analysis": {
          "analyzer": {                                                                                                                                
            "my_analyzer": {                                                                                                                           
              "tokenizer": "standard",
              "char_filter": ["replace_dots"]
            }
          },
          "char_filter": {
            "replace_dots": {
              "type": "mapping",
              "mappings": [
                ". => \\u0020"
              ]
            }
          }
        }
      }
    }
    
    POST /my_index/_analyze
    {                                                                           
      "analyzer": "my_analyzer",                                            
      "text": "8080.foobar.getFooLabelFrombar(test.java:91)"
    }
    

    哪个输出你想要的:

    {                                                                               
      "tokens" : [
        {
          "token" : "8080",
          "start_offset" : 0,
          "end_offset" : 4,
          "type" : "<NUM>",
          "position" : 0
        },
        {
          "token" : "foobar",
          "start_offset" : 10,
          "end_offset" : 16,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "getFooLabelFrombar",
          "start_offset" : 17,
          "end_offset" : 35,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "test",
          "start_offset" : 36,
          "end_offset" : 40,
          "type" : "<ALPHANUM>",
          "position" : 4
        },
        {
          "token" : "java",
          "start_offset" : 41,
          "end_offset" : 45,
          "type" : "<ALPHANUM>",
          "position" : 5
        },
        {
          "token" : "91",
          "start_offset" : 46,
          "end_offset" : 48,
          "type" : "<NUM>",
          "position" : 6
        }
      ]
    }
    

    【讨论】:

      猜你喜欢
      • 2017-06-26
      • 1970-01-01
      • 1970-01-01
      • 2016-05-20
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多