忽略词频但使用位置答案

【问题标题】：Ignore term frequency but use positions忽略词频但使用位置
【发布时间】：2020-09-14 11:15:15
【问题描述】：

我有一个带有文本字段的索引，我想在评分中忽略术语频率，但保留位置以具有匹配短语搜索能力。

我的索引定义如下：

curl --location --request PUT 'localhost:9200/my-index-001' \
--header 'Content-Type: application/json' \
--data-raw '{
    "mappings": {
        "autocomplete": {
            "properties": {
                "title": {
                    "type": "text",
                    "analyzer": "row_autocomplete"
                },
                "name": {
                    "type": "text",
                    "analyzer": "row_autocomplete"
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "row_autocomplete": {
                    "tokenizer": "icu_tokenizer",
                    "filter": ["icu_folding", "autocomplete_filter", "lowercase"]
                }
            },
            "filter": {
                "autocomplete_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            }
        }
    }
}'

索引数据：

[
    {
        "title": "university",
        "name": "london and EC london English"
    },
    {
        "title": "city",
        "name": "london"
    }
]

当我执行这样的匹配查询时，我希望城市获得高分：

POST _search

{
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "name": {
                            "query": "london"
                        }
                    }
                },
                {
                    "match_phrase": {
                        "name": {
                            "query": "london",
                        }
                    }
                }
            ]
        }
    }
}

他们得到不同的分数（大学实际上大于城市）因为词频，我想要的只是计算词频一次，根据fieldLength，城市的fieldLength小于大学的fieldLength ，所以如果我可以忽略重复termFreq，城市的分数会大于大学参考elasticsearch的规则：

GET _explain

# city's _explain
{
    "value": 2.0785222,
    "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
    "details": [
        {
            "value": 6.0,
            "description": "termFreq=6.0",
            "details": []
        },
        {
            "value": 2.0,
            "description": "fieldLength",
            "details": []
        },
        ...
    ]
}

# university's explain
{
    "value": 2.1087635,
    "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
    "details": [
        {
            "value": 24.0,
            "description": "termFreq=24.0",
            "details": []
        },
        {
            "value": 29.0,
            "description": "fieldLength",
            "details": []
        },
        ...
    ]
}

我尝试了一些方法，例如在索引映射中，我可以设置 index_options=docs 以忽略术语频率，但这会禁用术语位置，并且我不能再使用匹配短语查询。

有人知道吗？

提前致谢。

【问题讨论】：

标签： elasticsearch

【解决方案1】：

您可以使用 constant score query 包装过滤器查询并返回每个匹配的文档，其相关性得分等于 boost 参数值。

如果您使用constant score query，那么您的匹配查询将只给出 0 或 1 以外的任何分数。这是因为它就像一个过滤器，将判断查询是否匹配。 match 查询不会像基于全文搜索的匹配。

constant_score 查询接受一个设置为分数的 boost 参数与其他查询结合使用时为每个返回的文档。经过默认提升设置为 1。

请参阅此以获取有关bool filter 和此SO answer 的详细说明，以了解常量分数查询和布尔过滤器之间的区别。

添加一个包含索引数据、搜索查询和搜索结果的工作示例。

索引数据

{ "name": "london only" }
{ "name": "london and London" }

搜索查询：

{
    "query": {
        "constant_score": {
            "filter": {
                "bool": {
                    "must": [
                        {
                            "match": {
                                "name": "london"
                            }
                        }
                    ]
                }
            }
        }
    }
}

搜索结果：

"hits": [
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.0,
                "_source": {
                    "name": "london and london"
                }
            },
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "2",
                "_score": 1.0,
                "_source": {
                    "name": "london only"
                }
            }
        ]

【讨论】：

谢谢，@Bhavya，也许我没有清楚地报告我的问题，我已经尝试过 constant_score 查询，但这不是我想要的，我会再次编辑我的问题以使其准确，无论如何，谢谢你的回答。

【解决方案2】：

我使用默认索引映射为您提供的两个示例文档编制了索引，因此title 和name 字段都是文本字段。并使用了相同的查询，它返回了仅包含 london 的文档的高分，如下所示：

"hits": [
            {
                "_index": "matchrel",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.51518387,
                "_source": {
                    "title": "city",
                    "name": "london"
                }
            },
            {
                "_index": "matchrel",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.41750965,
                "_source": {
                    "title": "university",
                    "name": "london university and EC London English"
                }
            }
        ]

另外，由于您没有详细解释您的用例，并且信息有限，似乎可以通过以下查询轻松实现，并且还为london doc 返回更多分数：

{ “询问”： { “匹配短语”：{ “名称”：“伦敦” } } }

及其搜索结果

 "hits": [
            {
                "_index": "matchrel",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.25759193, // note score
                "_source": {
                    "title": "city",
                    "name": "london"
                }
            },
            {
                "_index": "matchrel",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.20875482,
                "_source": {
                    "title": "university",
                    "name": "london university and EC London English"
                }
            }
        ]

【讨论】：

我确实忘了提及我的分析仪，我再次编辑了我的问题，感谢您的帮助。