带有 AND 运算符的 Elasticsearch 多匹配查询，用于 Hyphenation_decompounder 令牌过滤器生成的令牌答案

【问题标题】：Elasticsearch Multi-Match Query with AND operator for the tokens generated by Hyphenation_decompounder token filter带有 AND 运算符的 Elasticsearch 多匹配查询，用于 Hyphenation_decompounder 令牌过滤器生成的令牌
【发布时间】：2021-01-21 11:46:14
【问题描述】：

我使用hyphenation_decompounder 表示德语，并按照文档中提到的示例进行操作。到目前为止，一切都很好。有用！。文本kaffeetasse 被标记为kaffee 和tasse。

当我使用 kaffeetasse 的 multi-match 查询来查找 kaffee 和 tasse 都匹配的文档时，出现了问题。似乎多重匹配对 hyphenation_decompounder 过滤器生成的标记使用 OR 而不是多重匹配查询中的给定 Operator("AND")。这是我的测试用例

映射

curl -XPUT "http://localhost:9200/testidx" -H 'Content-Type: application/json' -d'{  "settings": {    "index": {      "analysis": {        "analyzer": {          "index": {            "type" : "custom",            "tokenizer": "whitespace",            "filter": [ "lowercase" ]          },          "search": {            "type" : "custom",            "tokenizer": "whitespace",            "filter": [ "lowercase", "hyph" ]          }        },        "filter": {          "hyph": {            "type": "hyphenation_decompounder",            "hyphenation_patterns_path": "analysis/de_DR.xml",            "word_list": ["kaffee", "zucker", "tasse"],            "only_longest_match": true,            "min_subword_size": 4          }        }      }    }  },    "mappings" : {      "properties" : {        "title" : {          "type" : "text",          "analyzer": "index",          "search_analyzer": "search"        },        "description" : {          "type" : "text",          "analyzer": "index",          "search_analyzer": "search"        }      }    }  }'

文档 id=1

curl -XPOST "http://localhost:9200/testidx/_doc/1" -H 'Content-Type: application/json' -d'{  "title" : "Kaffee",  "description": "Milch Kaffee tasse"}'

文档 id=2

curl -XPOST "http://localhost:9200/testidx/_doc/2" -H 'Content-Type: application/json' -d'{  "title" : "Kaffee",  "description": "Latte Kaffee Becher"}'

多重匹配查询

curl -XGET "http://localhost:9200/testidx/_search" -H 'Content-Type: application/json' -d'{  "query": {    "multi_match": {      "query": "kaffeetasse",      "fields": ["title", "description"],      "operator": "and",     "type": "cross_fields",     "analyzer": "search"    }  }}'

我的期望是 elasticsearch 应该只返回具有 id=1 的单个文档，因为它的字段中有 kaffee AND tasse 但它返回两个文档为两者都有kaffee 或 tasse 文本。

弹性搜索：7.9.2

de_DR.xml 从https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/download 下载，如文档中所述。

【问题讨论】：

在同样的问题上苦苦挣扎。我们使用诸如Tablethalterung 之类的英文单词进行电子商务搜索。为了分解它，我们使用了一个特定领域的单词列表，它正确地拆分了英语复合词。但是结果没有用，因为我们得到了包含tablet 或halter 的所有文档，而不是tablet 和halter 我们当前基于ES 2 的搜索按预期处理了这个问题。你找到解决方案了吗？
@ThomasHaarbach，没有找到任何基于 Elasticsearch 的解决方案。到目前为止，在发送到 ES 之前使用 Lucene 断字分解器对查询进行标记似乎是一种可行的解决方法。
嗯，这行得通。我已经用 es analysies 往返尝试过，结果也很糟糕，因为原始术语在令牌流中，例如[kaffeetasse，咖啡，tasse]。打开本主题discuss.elastic.co/t/…

标签： elasticsearch lucene

【解决方案1】：

Elasticsearch 返回两个文档，因为它将operator 参数应用于原始查询 kaffeetasse，而不是分析器生成的标记kaffee 和tasse。在documentation 中针对match 查询描述的此类行为：

operator（可选，字符串）用于解释 query 值中的文本的布尔逻辑。

由于原始查询是一个词，operator参数没有意义。

作为一种解决方法，您可以分两步执行搜索：

使用analyze API分析您的原始查询字符串：

 curl -XGET "http://localhost:9200/testidx/_analyze" -H 'Content-Type: application/json' -d'{"analyzer": "search", "text": "kaffeetasse"}'

使用从search 分析器接收到的令牌作为multi_match 查询的词，其中operator 参数设置为and，analyzer 参数设置为whitespace（以防止已经分析的令牌再次使用@ 进行分析987654337@分析仪）：

 curl -XGET "http://localhost:9200/testidx/_search" -H 'Content-Type: application/json' -d'{ "query": {"multi_match": {"query": "kaffee tasse", "fields": ["title", "description"], "operator": "and", "type": "cross_fields", "analyzer": "whitespace"}}}'

【讨论】：

我担心执行 Step-1 的往返行程所花费的 额外时间。我正在考虑提取 Lucene hyphenation decompounder 功能并在将查询发送到 Elasticsearch 之前对其进行预处理。