获取热门数组聚合的唯一文档计数 sum_other_doc_count答案

【问题标题】：Getting unique document counts for top hits array aggregation, sum_other_doc_count获取热门数组聚合的唯一文档计数 sum_other_doc_count
【发布时间】：2020-06-14 17:32:21
【问题描述】：

我有大量包含关键字值数组的文档（数百万）：

映射：

{
    "my_index": {
        "mappings": {
            "properties": {
                "id": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "keywords": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        }
    }
}

示例文件：

{
  "id": "abc",
  "keywords": ["cat", "dog", "person"]
}
{
  "id": "def",
  "keywords": ["tree", "person"]
}
{
  "id": "ghi",
  "keywords": ["person", "human"]
}
...

假设我获得了前 3 个关键字桶，其余的则显示在“其他”中，如下所示：

/GET /my_index/_search
{
    "size": 0,
    "track_total_hits": true,
    "aggs": {
        "keyword_buckets": {
            "terms": {
                "field": "keywords.keyword",
                "size": 3
            }
        }
    }
}

有 2,232,121 个文档，但我得到的存储桶是这样的：

{
    "took": 256,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2232121,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "keyword_buckets": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 6250132,
            "buckets": [
                {
                    "key": "person",
                    "doc_count": 326552
                },
                {
                    "key": "human",
                    "doc_count": 326529
                },
                {
                    "key": "photograph",
                    "doc_count": 222190
                }
            ]
        }
    }
}

我在“其他”存储桶中获得了 6,250,132 个文档。我的期望是前 3 名和“其他”的总和为 2,232,121。在 SQL 术语中，它将获得所有存储桶的 DISTINCT 文档计数。

我需要做什么查询才能实现这一目标？

【问题讨论】：

标签： elasticsearch unique aggregation

【解决方案1】：

Elasticsearch 不会为您提供准确的 doc_count。文档计数始终是近似值。这是因为根据设计的弹性搜索查询会查看每个分片的顶级术语并将它们组合起来。你可以阅读更多关于它的信息here。

【讨论】：

是的，我读到了，但它偏离了这么多似乎很奇怪。我们正在谈论 300% 以上的差异
特别注意"doc_count_error_upper_bound": 0的部分
是的，这是意料之中的。 Elasticsearch 不像 SQL 那样工作。您可以使用此处描述的这些方法来提高准确性qbox.io/blog/elasticsearch-aggregation-custom-analyzer
即使我将show_term_doc_count_error 设置为true，为什么ES 在每种情况下都说错误为0？