弹性搜索跨域，边缘 ngram 分析器答案

【问题标题】：Elastic search cross fields, edge ngram analyzer弹性搜索跨域，边缘 ngram 分析器
【发布时间】：2017-02-09 15:18:16
【问题描述】：

我有 999 个文档用于试验弹性搜索。

我的类型映射中有一个字段 f4 被分析并具有以下分析器设置：

  "myNGramAnalyzer" => [
       "type" => "custom",
        "char_filter" => ["html_strip"],
        "tokenizer" => "standard",
        "filter" => ["lowercase","standard","asciifolding","stop","snowball","ngram_filter"]
  ]

我的过滤器如下：

  "filter" => [
        "ngram_filter" => [
            "type" => "edgeNGram",
            "min_gram" => "2",
            "max_gram" => "20"
        ]
  ]

字段 f4 的值为“Proj1”、“Proj2”、“Proj3”......等等。

现在，当我尝试使用“proj1”字符串的交叉字段进行搜索时，我期待带有“Proj1”的文档以最高分数返回到响应的顶部。但事实并非如此。其余所有数据内容几乎相同。

我也不明白为什么它匹配所有 999 文档？

以下是我的搜索：

{
    "index": "myindex",
    "type": "mytype",
    "body": {
        "query": {
            "multi_match": {
                "query": "proj1",
                "type": "cross_fields",
                "operator": "and",
                "fields": "f*"
            }
        },
        "filter": {
            "term": {
                "deleted": "0"
            }
        }
    }
}

我的搜索结果是：

{
    "took": 12,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 999,
        "max_score": 1,
        "hits": [{
            "_index": "myindex",
            "_type": "mytype",
            "_id": "42",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "125650","f3": "BH.1511AI.001",
                "f4": "Proj42",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        }, {
            "_index": "myindex",
            "_type": "mytype",
            "_id": "47",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "137946","f3": "BH.152096.001",
                "f4": "Proj47",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        }, 
        //.......
        //.......
        //MANY RECORDS IN BETWEEN HERE
        //.......
        //.......
        {
            "_index": myindex,
            "_type": "mytype",
            "_id": "1",
            "_score": 1,
            "_source": {
                "f1": "396","f2": "142095","f3": "BH.705215.001",
                "f4": "Proj1",
                "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
            }
        //.......
        //.......
        //MANY RECORDS IN BETWEEN HERE
        //.......
        //.......
        }]
    }
}

我做错了什么或遗漏了什么？（对于冗长的问题，我深表歉意，但我想提供所有可能的信息，丢弃不必要的其他代码）。

已编辑：

词向量响应

{
    "_index": "myindex",
    "_type": "mytype",
    "_id": "10",
    "_version": 1,
    "found": true,
    "took": 9,
    "term_vectors": {
        "f4": {
            "field_statistics": {
                "sum_doc_freq": 5886,
                "doc_count": 999,
                "sum_ttf": 5886
            },
            "terms": {
                "pr": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "pro": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj": {
                    "doc_freq": 999,
                    "ttf": 999,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj1": {
                    "doc_freq": 111,
                    "ttf": 111,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                },
                "proj10": {
                    "doc_freq": 11,
                    "ttf": 11,
                    "term_freq": 1,
                    "tokens": [{
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 6
                    }]
                }
            }
        }
    }
}

已编辑 2

字段 f4 的映射

"f4" : {
    "type" : "string",
    "index_analyzer" : "myNGramAnalyzer",
    "search_analyzer" : "standard"
}

我已更新为使用标准分析器进行查询时间，这改善了结果，但仍不是我所期望的。

现在它返回 111 个文档，而不是 999（所有文档），例如“Proj1”、“Proj11”、“Proj111”......“Proj1”、“Proj181”......等.

“Proj1”仍然在结果之间，而不是在顶部。

【问题讨论】：

您能否检查文档之一的词向量：elastic.co/guide/en/elasticsearch/reference/current/…
@alpert 使用术语向量响应更新问题
您能否将 multi_match 搜索查询的 type 从 cross_fields 更改为 best_fields 并再次检查结果是否是所需的结果。
我已经试过了，没有任何改善。
能否请您发送您的myIndex 映射？

标签： java amazon-web-services elasticsearch full-text-search search-engine

【解决方案1】：

没有index_analyzer（至少在Elasticsearch 1.7 版中没有）。对于mapping parameters，您可以使用analyzer 和search_analyzer。请尝试以下步骤以使其正常工作。

使用分析器设置创建 myindex：

PUT /myindex
{
   "settings": {
     "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "edge_ngram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "myNGramAnalyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "char_filter": "html_strip",
               "filter": [
                  "lowercase",
                  "standard",
                  "asciifolding",
                  "stop",
                  "snowball",
                  "ngram_filter"
               ]
            }
         }
      }
   }
}

向 mytype 添加映射（为了简短起见，我只是映射了相关字段）：

PUT /myindex/_mapping/mytype
{
   "properties": {
      "f1": {
         "type": "string"
      },
      "f4": {
         "type": "string",
         "analyzer": "myNGramAnalyzer",
         "search_analyzer": "standard"
      },
      "deleted": {
         "type": "string"
      }
   }
}

索引一些数据：

PUT myindex/mytype/1
{
    "f1":"396",
    "f4":"Proj12" ,
    "deleted": "0"
}

PUT myindex/mytype/2
{
    "f1":"42",
    "f4":"Proj22" ,
    "deleted": "1"
}

现在试试你的查询：

GET myindex/mytype/_search
{
   "query": {
      "multi_match": {
         "query": "proj1",
         "type": "cross_fields",
         "operator": "and",
         "fields": "f*"
      }
   },
   "filter": {
      "term": {
         "deleted": "0"
      }
   }
}

它应该返回文档#1。它对我有用Sense。我正在使用Elasticsearch 2.X 版本。

希望我能提供帮助:)

【讨论】：

您是否尝试过通过添加字段 f4 为 Proj1、Proj11、Proj12、Proj13、Proj121、Proj111 的文档，因为我的东西不适用于这个。它已经适用于您在示例中使用的文档。
另外，我知道index_analyzer 我正在使用支持它的旧版本。
当我索引时：PUT myindex/mytype/_bulk { "index" : { "_id" : "1" } } { "f1":"396","f4":"Proj1","deleted":"0" } { "index" : { "_id" : "2" } } { "f1":"396","f4":"Proj11","deleted":"0" } { "index" : { "_id" : "3" } } { "f1":"396","f4":"Proj13","deleted":"1" } { "index" : { "_id" : "4" } } { "f1":"396","f4":"Proj121","deleted":"1" } { "index" : { "_id" : "5" } } { "f1":"396","f4":"Proj111","deleted":"1" } 我得到的文件是：#1 和 #2 这不是你想要的吗？
另外，我看到问题中的查询匹配所有文档。您的意思是所有以 prog1 开头的文档吗？由于结果命中包含proj47。您能分享一下您使用的 Elasticsearch 版本吗？
它可能会为您返回，而不是为我返回，因为我的文档未按相同顺序编入索引，而且我正在使用 1.5.2 对于现在未返回的“Proj42”我现在有不同的索引和搜索分析器，之前我没有使用它，因此它在响应中返回。

【解决方案2】：

花了几个小时寻找解决方案后，我终于成功了。

因此，我在索引数据时使用 n gram analzyer 保持问题中提到的所有内容。我唯一需要改变的是，使用我的搜索查询中的 all 字段作为布尔查询与我现有的 multi-match 查询。

现在我的搜索文本Proj1 的结果将按Proj1、Proj121、Proj11 等顺序返回我的结果。

虽然这不会像Proj1、Proj11、Proj121 等那样返回确切的顺序，但它仍然与我想要的结果非常相似。

【讨论】：