【发布时间】:2020-09-14 11:15:15
【问题描述】:
我有一个带有文本字段的索引,我想在评分中忽略术语频率,但保留位置以具有匹配短语搜索能力。
我的索引定义如下:
curl --location --request PUT 'localhost:9200/my-index-001' \
--header 'Content-Type: application/json' \
--data-raw '{
"mappings": {
"autocomplete": {
"properties": {
"title": {
"type": "text",
"analyzer": "row_autocomplete"
},
"name": {
"type": "text",
"analyzer": "row_autocomplete"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"row_autocomplete": {
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "autocomplete_filter", "lowercase"]
}
},
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
}
}'
索引数据:
[
{
"title": "university",
"name": "london and EC london English"
},
{
"title": "city",
"name": "london"
}
]
当我执行这样的匹配查询时,我希望城市获得高分:
POST _search
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "london"
}
}
},
{
"match_phrase": {
"name": {
"query": "london",
}
}
}
]
}
}
}
他们得到不同的分数(大学实际上大于城市)因为词频,我想要的只是计算词频一次,根据fieldLength,城市的fieldLength小于大学的fieldLength ,所以如果我可以忽略重复termFreq,城市的分数会大于大学参考elasticsearch的规则:
GET _explain
# city's _explain
{
"value": 2.0785222,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 6.0,
"description": "termFreq=6.0",
"details": []
},
{
"value": 2.0,
"description": "fieldLength",
"details": []
},
...
]
}
# university's explain
{
"value": 2.1087635,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 24.0,
"description": "termFreq=24.0",
"details": []
},
{
"value": 29.0,
"description": "fieldLength",
"details": []
},
...
]
}
我尝试了一些方法,例如在索引映射中,我可以设置 index_options=docs 以忽略术语频率,但这会禁用术语位置,并且我不能再使用匹配短语查询。
有人知道吗?
提前致谢。
【问题讨论】:
标签: elasticsearch