使用匹配短语查询后的 Elasticsearch 唯一文档答案

【问题标题】：Elasticsearch unique documents after querying with match-phrase使用匹配短语查询后的 Elasticsearch 唯一文档
【发布时间】：2021-11-04 15:46:10
【问题描述】：

嘿堆栈溢出我有一个如下所示的弹性搜索文档。我只对“标签”键感兴趣。

 "_index": "graph_20211025t0909",
                "_type": "_doc",
                "_id": "E12201A5-CC50-40AF-97AE-C54A2CA303F7",
                "_score": null,
                "_source": {
                    "entity_id": "E12201A5-CC50-40AF-97AE-C54A2CA303F7",
                    "properties": {
                        "external": {
                            "facebook": {
                                "id": "muji.jp"
                            },
                            "instagram": {
                                "id": "muji_global"
                            },
                            "twitter": {
                                "id": "muji_net"
                            },
                            "wikidata": {
                                "id": "Q708789"
                            }
                        },
                        "akas": [
                            {
                                "value": "Muji",
                                "language": "zh"
                            },
                            {
                                "value": "multinacional japonesa",
                                "language": "es"
                            },
                        ]
                    },
                    "data_source": {
                        "data_pull_date": "202109",
                        "source_id": "muji_global",
                        "dataset": "brand"
                    },
                    "scoring_entity_data_size": 5306,
                    "population_percentile": 0.9855572298745676,
                    "type_synonyms": [],
                    "@version": "1",
                    "@timestamp": "2021-10-25T16:28:24.892Z",
                    "name": "Muji",
                    "types": [
                        "urn:entity:brand"
                    ],
                    "tags": [
                        {
                            "tag_id": "D24DE9CF-C778-4468-8433-5A0E8AA2BA9D",
                            "name": "Wikipedia articles with GND identifiers",
                            "type": "urn:tag:wikipedia_category"
                        },
                        {
                            "tag_id": "67A608CC-2DA3-4C78-B7F6-6DD419744FFC",
                            "name": "Clothing brands of Japan",
                            "type": "urn:tag:wikipedia_category"
                        },
]
}

我的弹性搜索查询是

{
    "size": 20,
    "_source": ["tags"],
    "sort": [
        { "@timestamp": { "order": "desc" } }
    ],
    "query": {
        "nested" : {
            "path" : "tags",
                "query" : {
                    "bool" : {
                        "must" : [
                          { "match_phrase" : {"tags.name" : "thriller"} }
                        ]    
                }
            }
        }
    }
}

我的问题是我的查询如何根据我的 Elasticsearch 查询返回 unique 文档？我正在“tags”字段中搜索“tags.name”。我希望我的“标签”字段返回一组独特的项目，例如我目前正在返回

tags: [
{
                        {
                            "name": "Male actors",
                            "tag_id": "A2A18D57-24B5-4578-B0D3-2A9190EEAD7C",
                            "type": "urn:tag:wikipedia_category"
                        },
                        {
                            "name": "some tag name",
                            "tag_id": "0CB4BE42-026F-4B14-A59A-C5A331E8A56F",
                            "type": "urn:tag:wikipedia_category"
                        },
    },
                        {
                            "name": "Male actors",
                            "tag_id": "A2A18D57-24B5-4578-B0D3-2A9190EEAD7C",
                            "type": "urn:tag:wikipedia_category"
                        },
                        {
                            "name": "another tag name",
                            "tag_id": "0CB4BE42-026F-4B14-A59A-C5A331E8A56F",
                            "type": "urn:tag:wikipedia_category"
                        },
}

]

我希望我的结果不重复“name”：“男演员”

【问题讨论】：

标签： elasticsearch

【解决方案1】：

您的查询返回的tags 来自不同的文档，因此您不能假设它们是唯一的。我的建议是使用聚合来获得唯一的tags.name：

{
    "size": 20,
    "_source": ["tags"],
    "sort": [
        { "@timestamp": { "order": "desc" } }
    ],
    "query": {
        "nested" : {
            "path" : "tags",
                "query" : {
                    "bool" : {
                        "must" : [
                          { "match_phrase" : {"tags.name" : "thriller"} }
                        ]    
                }
            }
        }
    },
   "aggs": {
     "unique_tags": {
       "nested": {
         "path": "tags"
       },
       "aggs": {
         "tag_name": {
           "terms": {
              "field": "tags.name"
           }
         }
       }
    }
}

【讨论】：

鉴于我上面的查询，我试图使返回的文档唯一。我将如何使用上面列出的当前查询来执行此操作？
您只需将aggs 部分添加到您的查询中。我将编辑我的答案
感谢您这样做，但我只是使用了查询，结果出现错误。 Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [tags.name] in order to load field data by uninverting the inverted index. Note that this can use significant memory.
我认为这是因为字段tag.name 不是keyword 类型。如果您可以更新您的映射以包含tag.name 的keyword 字段，即tag.name.raw，它会起作用。