【问题标题】:ElasticSearch - Problems with edgeNGram tokenizerElasticSearch - edgeNGram 分词器的问题
【发布时间】:2015-08-18 05:16:09
【问题描述】:

我使用 ElasticSearch 来索引数据库。我正在尝试使用 edgeNGram 标记器将字符串切割成具有“新字符串必须长于 4 个字符”的要求的字符串。 我使用以下代码创建索引:

PUT test
POST /test/_close

PUT /test/_settings
{
    "analysis": {
      "analyzer": {
      "index_edge_ngram" : {
                "type": "custom",  
                "filter": ["custom_word_delimiter"],                
        "tokenizer" : "left_tokenizer"
      }         
    },
    "filter" : {
            "custom_word_delimiter" : {
                "type": "word_delimiter",
                "generate_word_parts": "true",
                "generate_number_parts": "true",
                "catenate_words": "false",
                "catenate_numbers": "false",
                "catenate_all": "false",
                "split_on_case_change": "false",
                "preserve_original": "false",
                "split_on_numerics": "true",
                "ignore_case": "true"
            }      
    },
    "tokenizer" : {
      "left_tokenizer" : {
        "max_gram" : 30,
        "min_gram" : 5,
        "type" : "edgeNGram"
      }
    }       
    } 
}

POST /test/_open

现在我运行测试以概述结果

GET /test/_analyze?analyzer=index_edge_ngram&text=please pay for multiple wins with only one payment

得到结果

{
   "tokens": [
      {
         "token": "pleas",
         "start_offset": 0,
         "end_offset": 5,
         "type": "word",
         "position": 1
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 2
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 3
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 4
      },
      {
         "token": "p",
         "start_offset": 7,
         "end_offset": 8,
         "type": "word",
         "position": 5
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 6
      },
      {
         "token": "pa",
         "start_offset": 7,
         "end_offset": 9,
         "type": "word",
         "position": 7
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 8
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 9
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 10
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 11
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 12
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 13
      },
      {
         "token": "f",
         "start_offset": 11,
         "end_offset": 12,
         "type": "word",
         "position": 14
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 15
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 16
      },
      {
         "token": "fo",
         "start_offset": 11,
         "end_offset": 13,
         "type": "word",
         "position": 17
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 18
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 19
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 20
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 21
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 22
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 23
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 24
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 25
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 26
      },
      {
         "token": "m",
         "start_offset": 15,
         "end_offset": 16,
         "type": "word",
         "position": 27
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 28
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 29
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 30
      },
      {
         "token": "mu",
         "start_offset": 15,
         "end_offset": 17,
         "type": "word",
         "position": 31
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 32
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 33
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 34
      },
      {
         "token": "mul",
         "start_offset": 15,
         "end_offset": 18,
         "type": "word",
         "position": 35
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 36
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 37
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 38
      },
      {
         "token": "mult",
         "start_offset": 15,
         "end_offset": 19,
         "type": "word",
         "position": 39
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 40
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 41
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 42
      },
      {
         "token": "multi",
         "start_offset": 15,
         "end_offset": 20,
         "type": "word",
         "position": 43
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 44
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 45
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 46
      },
      {
         "token": "multip",
         "start_offset": 15,
         "end_offset": 21,
         "type": "word",
         "position": 47
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 48
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 49
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 50
      },
      {
         "token": "multipl",
         "start_offset": 15,
         "end_offset": 22,
         "type": "word",
         "position": 51
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 52
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 53
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 54
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 55
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 56
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 57
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 58
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 59
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 60
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 61
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 62
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 63
      },
      {
         "token": "w",
         "start_offset": 24,
         "end_offset": 25,
         "type": "word",
         "position": 64
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 65
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 66
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 67
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 68
      },
      {
         "token": "wi",
         "start_offset": 24,
         "end_offset": 26,
         "type": "word",
         "position": 69
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 70
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 71
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 72
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 73
      },
      {
         "token": "win",
         "start_offset": 24,
         "end_offset": 27,
         "type": "word",
         "position": 74
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 75
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 76
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 77
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 78
      },
      {
         "token": "wins",
         "start_offset": 24,
         "end_offset": 28,
         "type": "word",
         "position": 79
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 80
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 81
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 82
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 83
      },
      {
         "token": "wins",
         "start_offset": 24,
         "end_offset": 28,
         "type": "word",
         "position": 84
      },
      {
         "token": "please",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 85
      },
      {
         "token": "pay",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 86
      },
      {
         "token": "for",
         "start_offset": 11,
         "end_offset": 14,
         "type": "word",
         "position": 87
      },
      {
         "token": "multiple",
         "start_offset": 15,
         "end_offset": 23,
         "type": "word",
         "position": 88
      },
      {
         "token": "wins",
         "start_offset": 24,
         "end_offset": 28,
         "type": "word",
         "position": 89
      },
      {
         "token": "w",
         "start_offset": 29,
         "end_offset": 30,
         "type": "word",
         "position": 90
      }
   ]
}

这是我的问题:

  1. 为什么有 5 个字符的 token?

  2. 为什么“position”属性会显示标记的位置,而不显示单词在文本中的位置?看起来其他标记器以这种方式工作。

  3. 为什么输出中没有所有单词?看起来它停在“胜利”上。

  4. 为什么同一个标记有这么多重复?

【问题讨论】:

    标签: elasticsearch tokenize


    【解决方案1】:

    在构建自定义分析器时,值得一步一步检查分析链中每个步骤生成的内容:

    1. 首先,分词器将您的输入切片并切块成令牌
    2. 然后令牌过滤器将步骤 1 中的令牌作为输入并执行其操作
    3. 终于应用了字符过滤器

    在您的情况下,如果您检查标记器阶段的结果,它会像这样。看到我们只是将tokenizer(即left_tokenizer)指定为参数。

     curl -XGET 'localhost:9201/test/_analyze?tokenizer=left_tokenizer&pretty' -d 'please pay for multiple wins with only one payment'
    

    结果是:

    {
      "tokens" : [ {
        "token" : "pleas",
        "start_offset" : 0,
        "end_offset" : 5,
        "type" : "word",
        "position" : 1
      }, {
        "token" : "please",
        "start_offset" : 0,
        "end_offset" : 6,
        "type" : "word",
        "position" : 2
      }, {
        "token" : "please ",
        "start_offset" : 0,
        "end_offset" : 7,
        "type" : "word",
        "position" : 3
      }, {
        "token" : "please p",
        "start_offset" : 0,
        "end_offset" : 8,
        "type" : "word",
        "position" : 4
      }, {
        "token" : "please pa",
        "start_offset" : 0,
        "end_offset" : 9,
        "type" : "word",
        "position" : 5
      }, {
        "token" : "please pay",
        "start_offset" : 0,
        "end_offset" : 10,
        "type" : "word",
        "position" : 6
      }, {
        "token" : "please pay ",
        "start_offset" : 0,
        "end_offset" : 11,
        "type" : "word",
        "position" : 7
      }, {
        "token" : "please pay f",
        "start_offset" : 0,
        "end_offset" : 12,
        "type" : "word",
        "position" : 8
      }, {
        "token" : "please pay fo",
        "start_offset" : 0,
        "end_offset" : 13,
        "type" : "word",
        "position" : 9
      }, {
        "token" : "please pay for",
        "start_offset" : 0,
        "end_offset" : 14,
        "type" : "word",
        "position" : 10
      }, {
        "token" : "please pay for ",
        "start_offset" : 0,
        "end_offset" : 15,
        "type" : "word",
        "position" : 11
      }, {
        "token" : "please pay for m",
        "start_offset" : 0,
        "end_offset" : 16,
        "type" : "word",
        "position" : 12
      }, {
        "token" : "please pay for mu",
        "start_offset" : 0,
        "end_offset" : 17,
        "type" : "word",
        "position" : 13
      }, {
        "token" : "please pay for mul",
        "start_offset" : 0,
        "end_offset" : 18,
        "type" : "word",
        "position" : 14
      }, {
        "token" : "please pay for mult",
        "start_offset" : 0,
        "end_offset" : 19,
        "type" : "word",
        "position" : 15
      }, {
        "token" : "please pay for multi",
        "start_offset" : 0,
        "end_offset" : 20,
        "type" : "word",
        "position" : 16
      }, {
        "token" : "please pay for multip",
        "start_offset" : 0,
        "end_offset" : 21,
        "type" : "word",
        "position" : 17
      }, {
        "token" : "please pay for multipl",
        "start_offset" : 0,
        "end_offset" : 22,
        "type" : "word",
        "position" : 18
      }, {
        "token" : "please pay for multiple",
        "start_offset" : 0,
        "end_offset" : 23,
        "type" : "word",
        "position" : 19
        "position" : 20
      }, {
        "token" : "please pay for multiple w",
        "start_offset" : 0,
        "end_offset" : 25,
        "type" : "word",
        "position" : 21
      }, {
        "token" : "please pay for multiple wi",
        "start_offset" : 0,
        "end_offset" : 26,
        "type" : "word",
        "position" : 22
      }, {
        "token" : "please pay for multiple win",
        "start_offset" : 0,
        "end_offset" : 27,
        "type" : "word",
        "position" : 23
      }, {
        "token" : "please pay for multiple wins",
        "start_offset" : 0,
        "end_offset" : 28,
        "type" : "word",
        "position" : 24
      }, {
        "token" : "please pay for multiple wins ",
        "start_offset" : 0,
        "end_offset" : 29,
        "type" : "word",
        "position" : 25
      }, {
        "token" : "please pay for multiple wins w",
        "start_offset" : 0,
        "end_offset" : 30,
        "type" : "word",
        "position" : 26
      } ]
    }
    

    然后,您的令牌过滤器将采用上述每个令牌并完成它们的工作。例如,

    • 第一个令牌pleas 将作为pleas 出来
    • 第二个令牌pleaseplease
    • 第三个令牌please(注意末尾的空格),如please
    • 第四个令牌please p作为两个令牌pleasep
    • 第五个令牌please pa作为两个令牌pleasepa

    因此,您的 left_tokenizer 将整个句子视为单个标记输入,并将其从 5 个字符标记为 30 个字符,这就是它停在 wins 的原因(这回答了问题 3)

    正如您在上面看到的,重复某些标记是因为 word_delimiter 标记过滤器将来自标记器的每个标记单独处理,因此“重复”(回答问题 4)和短于 5 个字符的标记(回答问题1)

    我认为这不是您希望它工作的方式,但是从您的问题中不清楚您希望它如何工作,即您希望能够进行的搜索类型。我在这里提供的只是对您所看到的内容的解释。

    【讨论】:

    • 谢谢您,先生!最佳答案!你让我大开眼界。你能帮我解决第二个问题吗?
    • 很高兴我能帮上忙。至于问题 2,正如您所注意到的,position 只是令牌在令牌列表中的位置。但是,start_offsetend_offset 会为您提供该标记在输入文本中的位置。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2015-05-05
    • 2012-12-16
    • 1970-01-01
    • 2012-07-03
    • 1970-01-01
    • 1970-01-01
    • 2019-02-22
    相关资源
    最近更新 更多