为什么在 elasticsearch 的全文搜索中，更精确的匹配比不太精确的匹配得分低？答案

【问题标题】：Why does a more exact match gets a lower score than a less exact match in a full text search in elasticsearch?为什么在 elasticsearch 的全文搜索中，更精确的匹配比不太精确的匹配得分低？
【发布时间】：2019-04-03 05:18:09
【问题描述】：

我有一些数据是从弹性搜索中搜索的，因为与 MongoDB 相比，它提供了更好的全文搜索。但我面临一些问题，其中之一是：

我在 elasticsearch 中保存的数据如下：

[{
   "word": "tidak berpuas hati",
   "type": "NEGATIVE",
   "score": -0.3908697916666666
  },{
   "word": "berpuas hati",
   "type": "POSITIVE",
   "score": 0.65375
  },{
   "word": "hati",
   "type": "POSITIVE",
   "score": 0.6
  },{
   "word": "tidak",
   "type": "NEGATIVE",
   "score": 0.6
}]

但是当我在这个数据中搜索 saya tidak berpuas hati 句子时。我得到这样的回应：

"hits": [
 {
    "_index": "sentiment",
    "_type": "ms",
    "_id": "8SPiimYBKsyQt_Jg1VYa",
    "_score": 8.838576,
    "_source": {
       "word": "berpuas hati",
       "type": "POSITIVE",
       "score": 0.65375
    },
    "highlight": {
       "word": [
          "<em>berpuas</em> <em>hati</em>"
       ]
    }
 },
 {
    "_index": "sentiment",
    "_type": "ms",
    "_id": "PiPiimYBKsyQt_Jg1U4U",
    "_score": 8.774891,
    "_source": {
       "word": "tidak berpuas hati",
       "type": "NEGATIVE",
       "score": -0.3908697916666666
    },
    "highlight": {
       "word": [
          "<em>tidak</em> <em>berpuas</em> <em>hati</em>"
       ]
    }
 },
 {
    "_index": "sentiment",
    "_type": "ms",
    "_id": "ByPiimYBKsyQt_Jg1VUZ",
    "_score": 5.045017,
    "_source": {
       "word": "hati",
       "type": "POSITIVE",
       "score": 0.6
    },
    "highlight": {
       "word": [
          "<em>hati</em>"
       ]
    }
  }
]

这是我的查询：

query = {
            "from": 0,
            "size": 20,
            "query": {
                "match": {
                    "word": {
                        "query": term,
                        "operator": 'or',
                        "fuzziness": 'auto'
                    }
                }
            },
            "highlight": {
                "fields": {
                    "word": {}
                }
            }
        }

所以这里的问题是我不明白为什么tidak berpuas hati 的分数不高于berpuas hati。当我将from 的值更改为1 时，它开始为这个句子工作，并停止为单个单词句子。

【问题讨论】：

在这个例子中我们讨论了多少数据？你的 ES 索引有多少个分片？看看elastic.co/guide/en/elasticsearch/reference/master/…也许这可以解释你的经历。
我有大约 25,000 个文档和 "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }
我想你想要的是匹配短语查询elastic.co/guide/en/elasticsearch/reference/current/…

标签： elasticsearch

【解决方案1】：

Elasticsearch 分数是按分片计算的。

在这种情况下，berpuas hati 的文档返回的分数更高，因为它在分片内部的相关性比tidak berpus hati 的文档更相关。

Elasticsearch 中的相关性是由多种因素决定的，尽管在这里我要说原因是因为tidak berpuas hati-shard 中有更多文档包含一个（或多个）术语tidakberpuas或hati，而不是berpuas hati-shard。这是巧合。

如果您在仅包含这两个文档的索引上尝试相同的查询，您会看到 berpuas hati 的得分约为 0.5，tidak berpuas hati 的得分约为 0.75。

您可以通过将"explain": true 添加到您的查询来找到分数如何成立的解释。评分算法在这里解释：https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

您可能还想阅读以下内容：https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-is-broken.html

【讨论】：

- 感谢您的精彩解释，我的理解是，当我们的数据较少时，最好有一个分片。
如果您不打算拥有超过 50GB 的数据（这是分片的推荐数量），可以。