【发布时间】:2020-04-23 12:53:05
【问题描述】:
我正在尝试创建一个以某种方式标记数据的弹性搜索文档。对于像Los Angeles (and vicinity), California, United States of America 这样的字符串,我希望排除( ) , 等符号,并且只包括字母数字字符。
我的ES索引设置如下
PUT /test-index
{
"settings": {
"analysis": {
"filter": {
"extract_alpha": {
"type": "keep_types",
"mode": "include",
"types": [
"<ALPHANUM>"
]
}
},
"analyzer": {
"my_autocomplete": {
"type":"custom",
"tokenizer":"my_tokenizer",
"filter" : [
"lowercase",
"extract_alpha"
]
}
},
"tokenizer": {
"my_tokenizer" : {
"type": "whitespace"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_autocomplete"
}
}
}
}
但是,当我运行此查询时
GET test-index/_analyze
{
"analyzer": "my_autocomplete",
"text": "Los Angeles (and vicinity), California, United States of America"
}
我得到了输出
{
"tokens" : [ ]
}
如果我删除 extract_alpha 过滤器,我会得到令牌但包含符号
{
"tokens" : [
{
"token" : "los",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "angeles",
"start_offset" : 4,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "(and",
"start_offset" : 12,
"end_offset" : 16,
"type" : "word",
"position" : 2
},
{
"token" : "vicinity),",
"start_offset" : 17,
"end_offset" : 27,
"type" : "word",
"position" : 3
},
{
"token" : "california,",
"start_offset" : 28,
"end_offset" : 39,
"type" : "word",
"position" : 4
},
{
"token" : "united",
"start_offset" : 40,
"end_offset" : 46,
"type" : "word",
"position" : 5
},
{
"token" : "states",
"start_offset" : 47,
"end_offset" : 53,
"type" : "word",
"position" : 6
},
{
"token" : "of",
"start_offset" : 54,
"end_offset" : 56,
"type" : "word",
"position" : 7
},
{
"token" : "america",
"start_offset" : 57,
"end_offset" : 64,
"type" : "word",
"position" : 8
}
]
}
我该如何解决这个问题,我做错了什么?
【问题讨论】:
标签: elasticsearch