【问题标题】:Elasticsearch multi field fuzzy search not returning exact match firstElasticsearch多字段模糊搜索不首先返回完全匹配
【发布时间】:2013-08-04 03:43:27
【问题描述】:

我正在对“文本”和“关键字”字段执行模糊弹性搜索查询。我在 elasticsearch 中有两个文档,一个带有“text”“testPhone 5”,另一个带有“testPhone 4s”。当我使用“testPhone 5”执行模糊查询时,我看到两个文档都被赋予了完全相同的分数值。为什么会出现这种情况?

额外信息:我正在使用“uax_url_email”标记器和“小写”过滤器为文档编制索引。

这是我正在做的查询:

{
    query : {
        bool: {
            // match one or the other fuzzy query
            should: [
                {
                    fuzzy: {
                        text: {
                            min_similarity: 0.4,
                            value: 'testphone 5',
                            prefix_length: 0,
                            boost: 5,
                        }
                    }
                },
                {
                    fuzzy: {
                        keywords: {
                            min_similarity: 0.4,
                            value: 'testphone 5',
                            prefix_length: 0,
                            boost: 1,
                        }
                    }
                }
            ]
        }
    },
    sort: [ 
        '_score'
    ],
    explain: true
}

这是结果:

{ max_score: 0.47213298,
  total: 2,
  hits:
  [ { _index: 'test',
     _shard: 0,
     _id: '51fbf95f82e89ae8c300002c',
     _node: '0Mtfzbe1RDinU71Ordx-Ag',
     _source:
    { next: { id: '51fbf95f82e89ae8c3000027' },
      cards: [ '51fbf95f82e89ae8c3000027', [length]: 1 ],
      other: false,
      _id: '51fbf95f82e89ae8c300002c',
      category: '51fbf95f82e89ae8c300002b',
      image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
      text: 'testPhone 5',
      keywords: [ [length]: 0 ],
      __v: 0 },
   _type: 'productgroup',
   _explanation:
    { details:
       [ { details:
            [ { details:
                 [ { details:
                      [ { details:
                           [ { value: 3.8888888, description: 'boost' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.17020021,
                               description: 'queryNorm' },
                             [length]: 3 ],
                          value: 0.99999994,
                          description: 'queryWeight, product of:' },
                        { details:
                           [ { details:
                                [ { value: 1, description: 'termFreq=1.0' },
                                  [length]: 1 ],
                               value: 1,
                               description: 'tf(freq=1.0), with freq of:' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.625,
                               description: 'fieldNorm(doc=0)' },
                             [length]: 3 ],
                          value: 0.944266,
                          description: 'fieldWeight in 0, product of:' },
                        [length]: 2 ],
                     value: 0.94426596,
                     description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
                   [length]: 1 ],
                value: 0.94426596,
                description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
              [length]: 1 ],
           value: 0.94426596,
           description: 'sum of:' },
         { value: 0.5, description: 'coord(1/2)' },
         [length]: 2 ],
      value: 0.47213298,
      description: 'product of:' },
   _score: 0.47213298 },
 { _index: 'test',
   _shard: 4,
   _id: '51fbf95f82e89ae8c300002d',
   _node: '0Mtfzbe1RDinU71Ordx-Ag',
   _source:
    { next: { id: '51fbf95f82e89ae8c3000027' },
      cards: [ '51fbf95f82e89ae8c3000029', [length]: 1 ],
      other: false,
      _id: '51fbf95f82e89ae8c300002d',
      category: '51fbf95f82e89ae8c300002b',
      image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
      text: 'testPhone 4s',
      keywords: [ 'apple', [length]: 1 ],
      __v: 0 },
   _type: 'productgroup',
   _explanation:
    { details:
       [ { details:
            [ { details:
                 [ { details:
                      [ { details:
                           [ { value: 3.8888888, description: 'boost' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.17020021,
                               description: 'queryNorm' },
                             [length]: 3 ],
                          value: 0.99999994,
                          description: 'queryWeight, product of:' },
                        { details:
                           [ { details:
                                [ { value: 1, description: 'termFreq=1.0' },
                                  [length]: 1 ],
                               value: 1,
                               description: 'tf(freq=1.0), with freq of:' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.625,
                               description: 'fieldNorm(doc=0)' },
                             [length]: 3 ],
                          value: 0.944266,
                          description: 'fieldWeight in 0, product of:' },
                        [length]: 2 ],
                     value: 0.94426596,
                     description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
                   [length]: 1 ],
                value: 0.94426596,
                description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
              [length]: 1 ],
           value: 0.94426596,
           description: 'sum of:' },
         { value: 0.5, description: 'coord(1/2)' },
         [length]: 2 ],
      value: 0.47213298,
      description: 'product of:' },
   _score: 0.47213298 },
 [length]: 2 ] }

【问题讨论】:

    标签: javascript elasticsearch


    【解决方案1】:

    模糊查询不会被分析,但该字段是这样的>

    description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },

    另请参阅@imotov 出色的答案: ElasticSearch's Fuzzy Query

    您可以查看使用_analyze API 将如何准确地标记字符串

    http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

    http://localhost:9200/prefix_test/_analyze?field=text&text=testphone+5

    将返回:

    {
       "tokens": [
          {
             "token": "testphone",
             "start_offset": 0,
             "end_offset": 9,
             "type": "<ALPHANUM>",
             "position": 1
          },
          {
             "token": "5",
             "start_offset": 10,
             "end_offset": 11,
             "type": "<NUM>",
             "position": 2
          }
       ]
    }
    

    因此,即使您将值 testphone sammsung 编入索引,“testphone samsunk”的模糊查询也不会像 samsunk 那样产生任何结果。

    不分析(或使用关键字分析器)字段可能会获得更好的结果。

    如果您想对单个字段进行不同的分析,可以使用multi_field 构造。

    http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html

    【讨论】:

      【解决方案2】:

      我最近自己遇到了这个问题。 我不能确切地告诉你为什么会这样,但我可以告诉你我是如何解决它的:

      我在同一个字段上运行了 2 个查询,其中一个是完全匹配的,然后在同一个字段上运行了完全相同的查询,启用了模糊匹配和较低的提升。

      这确保我的精确匹配总是比模糊匹配结束。

      附: 我认为他们得分相等,因为由于模糊性,匹配和 ES 并不关心一个是完全匹配,只要两者都匹配,但这是纯粹的理论制作,因为我不是非常熟悉评分算法。

      【讨论】:

      • 感谢您回答这个问题!但这对我不起作用,因为我需要按相似性顺序返回文档,因为我想做模糊搜索。因此,您的策略不适用于我的用例,因为我需要按分数正确排序模糊匹配(而不仅仅是首先精确匹配)。
      • 我不认为您有解决此问题的示例?我正在尝试与gist.github.com/rsmarshall/4c2d43ff859dfebb9faa 相同,但无法使其正常工作。
      • @rsmarsha:您应该尝试使用“应该”而不是“必须”。
      猜你喜欢
      • 2014-08-23
      • 2023-04-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-04-30
      • 2011-11-05
      相关资源
      最近更新 更多