【问题标题】:Elasticsearch: Querying nested objectsElasticsearch:查询嵌套对象
【发布时间】:2018-09-04 05:55:14
【问题描述】:

尊敬的 elasticsearch 专家,
我在查询嵌套对象时遇到问题。让我们使用以下简化映射:

{
  "mappings" : {
    "_doc" : {
      "properties" : {
        "companies" : {
          "type": "nested",
          "properties" : {
            "company_id": { "type": "long" },
            "name": { "type": "text" }
          }
        },
        "title": { "type": "text" }
      }
    }
  }
}

并将一些文档放入索引中:

PUT my_index/_doc/1
{
  "title" : "CPU release",
  "companies" : [
    { "company_id" : 1, "name" :  "AMD" },
    { "company_id" : 2, "name" :  "Intel" }
  ]
}

PUT my_index/_doc/2
{
  "title" : "GPU release 2018-01-10",
  "companies" : [
    { "company_id" : 1, "name" :  "AMD" },
    { "company_id" : 3, "name" :  "Nvidia" }
  ]
}

PUT my_index/_doc/3
{
  "title" : "GPU release 2018-03-01",
  "companies" : [
    { "company_id" : 3, "name" :  "Nvidia" }
  ]
}

PUT my_index/_doc/4
{
  "title" : "Chipset release",
  "companies" : [
    { "company_id" : 2, "name" :  "Intel" }
  ]
}

现在我想执行这样的查询:

{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "GPU" } },
        { "nested": {
            "path": "companies",
            "query": {
              "bool": {
                "must": [
                  { "match": { "companies.name": "AMD" } }
                ]
              }
            },
            "inner_hits" : {}
          }
        }
      ]
    }
  }
}

因此,我想获得具有匹配文件数量的匹配公司。所以上面的查询应该给我:

[
  { "company_id" : 1, "name" : "AMD", "matched_documents:": 1 }
]

以下查询:

{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "GPU" } }
        { "nested": {
            "path": "companies",
            "query": { "match_all": {} },
            "inner_hits" : {}
          }
        }
      ]
    }
  }
}

应该给我分配给标题包含“GPU”的文档的所有公司以及匹配文档的数量:

[
  { "company_id" : 1, "name" : "AMD", "matched_documents:": 1 },
  { "company_id" : 3, "name" : "Nvidia", "matched_documents:": 2 }
]

有没有可能以良好的性能达到这个结果?我明确对匹配文档不感兴趣,只对匹配文档的数量和嵌套对象感兴趣。

感谢您的帮助。

【问题讨论】:

    标签: elasticsearch


    【解决方案1】:

    在 Elasticsearch 方面你需要做的是:

    1. 根据所需条件过滤“父”文档(例如在title 中包含GPU,或者在companies 列表中提及Nvidia);
    2. 按特定标准对“嵌套”文档进行分组,bucket(例如company_id);
    3. 计算每个存储桶有多少“嵌套”文档。

    数组中的每个nested 对象都是indexed as a separate hidden document,这使生活有点复杂。让我们看看如何对它们进行聚合。

    那么如何对nested文档进行聚合统计呢?

    您可以通过 nestedtermstop_hits 聚合的组合来实现此目的:

    POST my_index/doc/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "title": "GPU"
              }
            },
            {
              "nested": {
                "path": "companies",
                "query": {
                  "match_all": {}
                }
              }
            }
          ]
        }
      },
      "aggs": {
        "Extract nested": {
          "nested": {
            "path": "companies"
          },
          "aggs": {
            "By company id": {
              "terms": {
                "field": "companies.company_id"
              },
              "aggs": {
                "Examples of such company_id": {
                  "top_hits": {
                    "size": 1
                  }
                }
              }
            }
          }
        }
      }
    }
    

    这将给出以下输出:

    {
      ...
      "hits": { ... },
      "aggregations": {
        "Extract nested": {
          "doc_count": 4, <== How many "nested" documents there were?
          "By company id": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": 3,  <== this bucket's key: "company_id": 3
                "doc_count": 2, <== how many "nested" documents there were with such company_id?
                "Examples of such company_id": {
                  "hits": {
                    "total": 2,
                    "max_score": 1.5897496,
                    "hits": [  <== an example, "top hit" for such company_id
                      {
                        "_nested": {
                          "field": "companies",
                          "offset": 1
                        },
                        "_score": 1.5897496,
                        "_source": {
                          "company_id": 3,
                          "name": "Nvidia"
                        }
                      }
                    ]
                  }
                }
              },
              {
                "key": 1,
                "doc_count": 1,
                "Examples of such company_id": {
                  "hits": {
                    "total": 1,
                    "max_score": 1.5897496,
                    "hits": [
                      {
                        "_nested": {
                          "field": "companies",
                          "offset": 0
                        },
                        "_score": 1.5897496,
                        "_source": {
                          "company_id": 1,
                          "name": "AMD"
                        }
                      }
                    ]
                  }
                }
              }
            ]
          }
        }
      }
    }
    

    请注意,对于Nvidia,我们有"doc_count": 2

    但是如果我们想计算拥有NvidiaIntel 的“父”对象的数量呢?

    如果我们想根据 nested 存储桶对父对象进行计数怎么办?

    可以通过reverse_nested聚合来实现。

    我们需要稍微改变一下我们的查询:

    POST my_index/doc/_search
    {
      "query": { ... },
      "aggs": {
        "Extract nested": {
          "nested": {
            "path": "companies"
          },
          "aggs": {
            "By company id": {
              "terms": {
                "field": "companies.company_id"
              },
              "aggs": {
                "Examples of such company_id": {
                  "top_hits": {
                    "size": 1
                  }
                },
                "original doc count": { <== we ask ES to count how many there are parent docs
                  "reverse_nested": {}
                }
              }
            }
          }
        }
      }
    }
    

    结果将如下所示:

    {
      ...
      "hits": { ... },
      "aggregations": {
        "Extract nested": {
          "doc_count": 3,
          "By company id": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": 3,
                "doc_count": 2,
                "original doc count": {
                  "doc_count": 2  <== how many "parent" documents have such company_id
                },
                "Examples of such company_id": {
                  "hits": {
                    "total": 2,
                    "max_score": 1.5897496,
                    "hits": [
                      {
                        "_nested": {
                          "field": "companies",
                          "offset": 1
                        },
                        "_score": 1.5897496,
                        "_source": {
                          "company_id": 3,
                          "name": "Nvidia"
                        }
                      }
                    ]
                  }
                }
              },
              {
                "key": 1,
                "doc_count": 1,
                "original doc count": {
                  "doc_count": 1
                },
                "Examples of such company_id": {
                  "hits": {
                    "total": 1,
                    "max_score": 1.5897496,
                    "hits": [
                      {
                        "_nested": {
                          "field": "companies",
                          "offset": 0
                        },
                        "_score": 1.5897496,
                        "_source": {
                          "company_id": 1,
                          "name": "AMD"
                        }
                      }
                    ]
                  }
                }
              }
            ]
          }
        }
      }
    }
    

    我怎样才能发现差异?

    为了让区别更明显,让我们稍微改变一下数据并在文档列表中添加另一个 Nvidia 项:

    PUT my_index/doc/2
    {
      "title" : "GPU release 2018-01-10",
      "companies" : [
        { "company_id" : 1, "name" :  "AMD" },
        { "company_id" : 3, "name" :  "Nvidia" },
        { "company_id" : 3, "name" :  "Nvidia" }
      ]
    }
    

    最后一个查询(带有reverse_nested 的查询)将为我们提供以下信息:

      "By company id": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": 3,
            "doc_count": 3,    <== 3 "nested" documents with Nvidia
            "original doc count": {
              "doc_count": 2   <== but only 2 "parent" documents
            },
            "Examples of such company_id": {
              "hits": {
                "total": 3,
                "max_score": 1.5897496,
                "hits": [
                  {
                    "_nested": {
                      "field": "companies",
                      "offset": 2
                    },
                    "_score": 1.5897496,
                    "_source": {
                      "company_id": 3,
                      "name": "Nvidia"
                    }
                  }
                ]
              }
            }
          },
    

    如您所见,这是一个难以掌握的细微差别,但它完全改变了语义。

    性能如何?

    虽然在大多数情况下nested 查询和聚合的性能应该足够了,但它当然会带来一定的成本。因此建议在tuning for search speed时避免使用nested或父子类型。

    在 Elasticsearch 中,最佳性能通常通过 denormalization 实现,尽管没有单一的配方,您应该根据需要选择数据模型。

    希望这可以为您澄清nested 的事情!

    【讨论】:

    • 非常感谢您的详细解答。希望我很快就会有时间检查它是否适用于我的项目。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-05-31
    • 2015-01-13
    • 1970-01-01
    • 2016-10-10
    • 2021-02-03
    • 1970-01-01
    相关资源
    最近更新 更多