【发布时间】:2017-03-07 01:44:42
【问题描述】:
我正在尝试实现一个 Elasticsearch pattern_capture 过滤器,它可以将 EDR-00004 转换为令牌:[EDR-00004, 00004, 4]。我(仍在)使用 Elasticsearch 2.4,但文档与当前 ES 版本没有区别。
我已按照文档中的示例进行操作: https://www.elastic.co/guide/en/elasticsearch/reference/2.4/analysis-pattern-capture-tokenfilter.html
这是我的测试和结果:
curl -XPUT 'localhost:9200/test_index' -d '{
"settings": {
"analysis": {
"filter": {
"process_number_filter": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": [
"([A-Za-z]+-([0]+([0-9]+)))"
]
}
},
"analyzer": {
"process_number_analyzer": {
"type": "custom",
"tokenizer": "pattern",
"filter": ["process_number_filter"]
}
}
}
}
}'
curl -XGET 'localhost:9200/test_index/_analyze' -d '
{
"analyzer": "process_number_analyzer",
"text": "EDR-00002"
}'
curl -XGET 'localhost:9200/test_index/_analyze' -d '
{
"analyzer": "standard",
"tokenizer": "standard",
"filter": ["process_number_filter"],
"text": "EDR-00002"
}'
返回:
{"acknowledged":true}
{
"tokens": [{
"token": "EDR",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
}, {
"token": "00002",
"start_offset": 4,
"end_offset": 9,
"type": "word",
"position": 1
}]
}
{
"tokens": [{
"token": "edr",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
}, {
"token": "00002",
"start_offset": 4,
"end_offset": 9,
"type": "<NUM>",
"position": 1
}]
}
我明白了
- 我不必对整个正则表达式进行分组,因为我设置了 preserve_original
- 我可以用 \d 和/或 \w 替换东西,但这样我就不必考虑转义了。
还要确保我的正则表达式是正确的。
>>> m = re.match(r"([A-Za-z]+-([0]+([0-9]+)))", "EDR-00004")
>>> m.groups()
('EDR-00004', '00004', '4')
【问题讨论】:
标签: elasticsearch