【问题标题】:ElasticSearch aggregations - to lowercase or not to lowercaseElasticSearch 聚合 - 小写或不小写
【发布时间】:2015-12-11 01:42:08
【问题描述】:

请注意以下情况:

定义映射

PUT /my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "string",
          "fields": {
            "raw": { 
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

添加数据

PUT /my_index/my_type/1
{
  "city": "New York"
}

PUT /my_index/my_type/2
{
  "city": "York"
}

PUT /my_index/my_type/3
{
  "city": "york"
}

查询构面

GET /my_index/_search
{
  "size": 0, 
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}

结果

{
...
  "aggregations": {
    "Cities": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "New York",
          "doc_count": 1
        },
        {
          "key": "York",
          "doc_count": 1
        },
        {
          "key": "york",
          "doc_count": 1
        }
      ]
    }
  }
}

两难境地

我想要两件事:

  1. “York”和“york”应该组合起来,而不是每个 1 次命中的 3 个存储桶,我会使用 2 个存储桶,一个用于“New York (1)”,一个用于“York (2)”
  2. 必须保留城市的大小写 - 我不希望构面值全部小写

梦想结果

{
    ...
      "aggregations": {
        "Cities": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "New York",
              "doc_count": 1
            },
            {
              "key": "York",
              "doc_count": 2
            }
          ]
        }
      }
    }

【问题讨论】:

    标签: elasticsearch


    【解决方案1】:

    这会使您的客户端代码稍微复杂一些,但您总是可以这样做。

    使用仅小写的附加子字段设置索引(不拆分空白):

    PUT /my_index
    {
       "settings": {
          "analysis": {
             "analyzer": {
                "lowercase_analyzer": {
                   "type": "custom",
                   "tokenizer": "keyword",
                   "filter": [
                      "lowercase"
                   ]
                }
             }
          }
       },
       "mappings": {
          "my_type": {
             "properties": {
                "city": {
                   "type": "string",
                   "fields": {
                      "lowercase": {
                         "type": "string",
                         "analyzer": "lowercase_analyzer"
                      },
                      "raw": {
                         "type": "string",
                         "index": "not_analyzed"
                      }
                   }
                }
             }
          }
       }
    }
    
    PUT /my_index/my_type/_bulk
    {"index":{"_id":1}}
    {"city":"New York"}
    {"index":{"_id":2}}
    {"city":"York"}
    {"index":{"_id":3}}
    {"city":"york"}
    

    然后使用像这样的两级聚合,其中第二个按字母升序排列(因此大写术语将排在第一位)并且只返回每个小写术语的顶部原始术语:

    GET /my_index/_search
    {
       "size": 0,
       "aggs": {
          "city_lowercase": {
             "terms": {
                "field": "city.lowercase"
             },
             "aggs": {
                "city_terms": {
                   "terms": {
                      "field": "city.raw",
                      "order" : { "_term" : "asc" },
                      "size": 1
                   }
                }
             }
          }
       }
    }
    

    返回:

    {
       "took": 5,
       "timed_out": false,
       "_shards": {
          "total": 5,
          "successful": 5,
          "failed": 0
       },
       "hits": {
          "total": 3,
          "max_score": 0,
          "hits": []
       },
       "aggregations": {
          "city_lowercase": {
             "doc_count_error_upper_bound": 0,
             "sum_other_doc_count": 0,
             "buckets": [
                {
                   "key": "york",
                   "doc_count": 2,
                   "city_terms": {
                      "doc_count_error_upper_bound": 0,
                      "sum_other_doc_count": 1,
                      "buckets": [
                         {
                            "key": "York",
                            "doc_count": 1
                         }
                      ]
                   }
                },
                {
                   "key": "new york",
                   "doc_count": 1,
                   "city_terms": {
                      "doc_count_error_upper_bound": 0,
                      "sum_other_doc_count": 0,
                      "buckets": [
                         {
                            "key": "New York",
                            "doc_count": 1
                         }
                      ]
                   }
                }
             ]
          }
       }
    }
    

    这是我使用的代码(还有一些文档示例):

    http://sense.qbox.io/gist/f3781d58fbaadcc1585c30ebb087108d2752dfff

    【讨论】:

    • 我无法访问 sense.qbox.io url 文档示例。请提供一个工作链接。
    猜你喜欢
    • 2020-09-20
    • 2017-07-19
    • 2015-12-15
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-02-21
    相关资源
    最近更新 更多