关于采用mongodb等nosql还是es作为存储机制,网上有一些讨论,LZ推荐参考https://blog.csdn.net/awdac/article/details/78117393,简单地说就是es可以认为是相比redis更加智能的加速层,但是它不应该作为直接存储机制,这一点和很多数据库的缓存机制是类似的,例如oracle的结果集缓存、timesten,mysql的query cache,只不过针对的场景不同,例如可以结合语义搜索。所以它的写入效率是比较低的,同时相比redis而言,它要重的多。
- Wikipedia使用Elasticsearch作为全文检索的工具
- GitHub使用Elasticsearch搜索代码
- 基于Lucene,Elasticsearch之于SQL,Lucene就像RDBMS引擎
- 使用java编写
启动 ./bin/elasticsearch -d 后台模式
http://localhost:9200/?pretty 查看版本等基本信息
配置文件config/elasticsearch.yml
原生为集群模式,类似rocketmq和kafka
节点间使用9300通信
请求格式'<PROTOCOL>://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>',BODY为JSON编码的请求体
Elasticsearch使用JSON作为序列化格式。
数据库和ES的对应关系如下:
Relational DB ⇒ Databases ⇒ Tables ⇒ Rows ⇒ Columns
Elasticsearch ⇒ Indices ⇒ Types ⇒ Documents ⇒ Fields
一个ES集群包含多个indices。index是一个逻辑命名空间,指向一个或多个shards,相当于oracle的segment。shard是Lucene的一个实例。Shards是Elasticsearch在集群内分布数据的单位。Elasticsearch会根据cluster的扩展和收缩自动在节点间迁移shards。一个shard可能是primary或replica。这跟couchbase的集群管理模式是一样的。默认情况下,一个index中有5个primary shards。
ES日常操作有三种客户端工具:postman(REST命令是下拉,查询选择POST即可,GET传递JSON不便),curl以及es自带的客户端Dev Tools(dev tool有些特殊,有些命令的兼容性更好reindex),三者的命令是一样的,都是REST API。head虽然能用,但是太简单。
查看所有的索引
GET _cat/indices yellow open wordbaseinfo_new KFKrcmJoQqWP9kyLzokLQw 1 1 18990 999 174.9mb 174.9mb yellow open search_doc_new_test RjfMfH5-Sdmh7rIgNoWRfw 1 1 2261 0 83.9mb 83.9mb yellow open testsearch 3nFp58OXSCCDCZKNBSr8yg 1 1 0 0 208b 208b green open .kibana-event-log-7.9.0-000004 zrGu0cA0Sle1GHIV2w-szQ 1 0 0 0 208b 208b green open .kibana-event-log-7.9.0-000005 8r7NEIxHSeGt1qCX98TFlg 1 0 0 0 208b 208b green open .kibana-event-log-7.9.0-000006 KaC-CnfhTDC81EZMUd6XeQ 1 0 0 0 208b 208b green open .kibana-event-log-7.9.0-000007 nE_Wv1ibQIW9cRGSt_IZfg 1 0 0 0 208b 208b green open .apm-custom-link CksbWamWQvaafywczHUbwA 1 0 0 0 208b 208b yellow open fais_search OEjrM5YwSJulOhD3T2y7Ig 1 1 64 15 2.6mb 2.6mb green open .kibana_task_manager_1 dhMlGVLjQ7Kq-VdrtI6RMg 1 0 6 20650 14.2mb 14.2mb yellow open inrulebaseinfo_new om0AqwSPRVqVq6GClq42zQ 1 1 8 10 325kb 325kb yellow open fais_test nDG2Ou9MSyKaShcB4kLzBA 1 1 0 0 208b 208b yellow open fail_search_test Gfe4cbi9RX-Dk9fxoAQH3g 1 1 51339 42 907.7mb 907.7mb yellow open word_item W9m8FuFRTzaagZU29y78mw 1 1 0 0 208b 208b yellow open search_doc_new_ic tCZigJFUTn6OWEQ3dH013A 1 1 75783 0 2.9gb 2.9gb yellow open wordbaseinfo_new_for_test fN12XUf6ScCdkIcI01IhfQ 1 1 18854 8440 139.4mb 139.4mb yellow open worditem uxkzSZToTp6cVkXdwsXSDg 1 1 0 0 208b 208b green open .apm-agent-configuration zaONhEkUTKqnAZbbTzCs0Q 1 0 0 0 208b 208b yellow open inrulebaseinfo_new_for_test SWj5BfMWTRyJH8WX7aXCKQ 1 1 0 0 208b 208b yellow open casebaseinfo rfqoCTfGQqOaCNRtbbkS_Q 1 1 17843 0 55.2mb 55.2mb yellow open time_test lW9FMLz1TuKzy6inK-gG0A 1 1 0 0 208b 208b green open .kibana_1 -vM1KSWdQG2zshWD4K0PPg 1 0 615 7 10.4mb 10.4mb yellow open article IpktM1wTSPO6B1Tp-eEiXA 1 1 1056 0 6.1mb 6.1mb green open .tasks 1DlF3FRSSvq2sB4ikpydCw 1 0 5 0 20.2kb 20.2kb yellow open search_doc_new_ic1 EUNqO51GTTGSycoHYhfZoA 1 1 0 0 208b 208b yellow open search_doc_new_ic_zjhua mARAxLD5QBGQFC6VcCdVVA 1 1 75783 0 3.5gb 3.5gb yellow open casebaseinfo_for_test Tt9EX2yYSHGDxeunzM4D5g 1 1 16981 0 50mb 50mb
创建索引
PUT http://localhost:9200/blogs
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
}
}
索引重建
PUT search_doc_new_ic_zjhua
POST _reindex { "source": { "index": "search_doc_new_ic" }, "dest": { "index": "search_doc_new_ic_zjhua" } }
执行成功了,但是7.14.1通过GET _tasks?actions=indices:data/write/reindex却查出来为空。
客户端容易超时,可以通过GET _tasks?actions=indices:data/write/reindex进行监控。
注意点:https://www.dazhuanlan.com/dolores63134/topics/1364488
重建索引可能导致数据丢失,见:https://segmentfault.com/q/1010000019003891。
还有一种是直接重建(必须是重建),以及重建的索引,见:https://blog.csdn.net/yexiaomodemo/article/details/97979376。
创建文档
格式:PUT {index}/{type}/{id}需要修改成PUT {index}/_doc/{id}
用postman PUT http://localhost:9200/megacorp/employee/1 -d '{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}'
返回 {"_index":"megacorp","_type":"employee","_id":"2","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}
ES 7.x去掉了
我们一直认为ES中的“index”类似于关系型数据库的“database”,而“type”相当于一个数据表。ES的开发者们认为这是一个糟糕的认识。例如:关系型数据库中两个数据表示是独立的,即使他们里面有相同名称的列也不影响使用,但ES中不是这样的。
我们都知道elasticsearch是基于Lucene开发的搜索引擎,而ES中不同type下名称相同的filed最终在Lucene中的处理方式是一样的。举个例子,两个不同type下的两个user_name,在ES同一个索引下其实被认为是同一个filed,你必须在两个不同的type中定义相同的filed映射。否则,不同type中的相同字段名称就会在处理中出现冲突的情况,导致Lucene处理效率下降。
去掉type能够使数据存储在独立的index中,这样即使有相同的字段名称也不会出现冲突,就像ElasticSearch出现的第一句话一样“你知道的,为了搜索····”,去掉type就是为了提高ES处理数据的效率。
除此之外,在同一个索引的不同type下存储字段数不一样的实体会导致存储中出现稀疏数据,影响Lucene压缩文档的能力,导致ES查询效率的降低。
如果没有设置ID,则ES会自动生成一个。如:
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_version": 13,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"created": false
}
_version代表更改的次数,一般来说,id不应该自动生成。
文档存储在哪个shard中的公式如下:shard = hash(routing) % number_of_primary_shards
routing默认是_id。
默认情况下,replication=sync。默认情况下replica=1。
会自动创建index megacorp,声明类型为employee,编号为1。
搜索文档
GET http://localhost:9200/megacorp/employee/1
存在时如下:
{"_index":"megacorp","_type":"employee","_id":"1","_version":1,"found":true,"_source":{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}}
_source中包含JSON原文档。
http://localhost:9200/megacorp/employee/111
不存在时如下:
{"_index":"megacorp","_type":"employee","_id":"111","found":false}
同时HTTP HEAD为404
查询指定字段
GET http: //localhost:9200/megacorp/employee/1?_source=first_name
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_version": 13,
"found": true,
"_source": {
"first_name": "John"
}
}
删除
DELETE http: //localhost:9200/megacorp/employee/111
{
"found": true,
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_version": 2,
"result": "deleted",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
}
}
模糊搜索
精确搜索就没有必要使用ES了,所以模糊搜索才是关键。
/_search {
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 1,
"hits": [{
"_index": "megacorp",
"_type": "employee",
"_id": "AV3Kp7BqVnBASvmzDScd",
"_score": 1,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
}
},
{
"_index": "megacorp",
"_type": "employee",
"_id": "2",
"_score": 1,
"_source": {
"first_name": "Jane",
"last_name": "Smith",
"age": 32,
"about": "I like to collect rock albums",
"interests": [
"music"
]
}
},
{
"_index": "megacorp",
"_type": "employee",
"_id": "AV3Kp5hsVnBASvmzDScc",
"_score": 1,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
}
},
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_score": 1,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
}
},
{
"_index": "megacorp",
"_type": "employee",
"_id": "3",
"_score": 1,
"_source": {
"first_name": "Douglas",
"last_name": "Fir",
"age": 35,
"about": "I like to build cabinets",
"interests": [
"forestry"
]
}
}
]
}
}
默认情况下,hits返回符合条件的前面10行,_score从高到低。如果要分页,则需要加上:http://localhost:9200/megacorp/employee/_search?size=2&from=2
搜索所有字段,真正的全文检索:http://localhost:9200/megacorp/employee/_search?q=John 在后台,其实是查询所有字段,内部有一个隐含的_all字段,类型为string。
各种语法可以参考https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax
type/mapping的结构(mapping/模式定义)
GET search_doc_new_ic/_mapping /* es7之前要用my_index/_mapping/my_type,7去掉了type*/
{
"search_doc_new_ic" : {
"mappings" : {
"properties" : {
"authors" : {
"type" : "keyword"
},
"content" : {
"properties" : {
"page_no" : {
"type" : "integer"
},
"paragraphs" : {
"type" : "text",
"index_options" : "offsets"
}
}
},
"doc_id" : {
"type" : "keyword"
},
"doc_source" : {
"type" : "keyword"
},
"file_id" : {
"type" : "keyword"
},
"file_name" : {
"type" : "keyword"
},
"fstore_group" : {
"type" : "keyword"
},
"fstore_path" : {
"type" : "keyword"
},
"industry_chain_nodes" : {
"properties" : {
"code" : {
"type" : "keyword"
},
"name" : {
"type" : "keyword"
}
}
},
"industry_chains" : {
"properties" : {
"code" : {
"type" : "keyword"
},
"name" : {
"type" : "keyword"
}
}
},
"industry_code" : {
"type" : "keyword"
},
"industry_name" : {
"type" : "keyword"
},
"invest_ranking" : {
"type" : "keyword"
},
"local_path" : {
"type" : "keyword"
},
"org_name" : {
"type" : "keyword"
},
"page_count" : {
"type" : "integer"
},
"publish_date" : {
"type" : "date"
},
"pv" : {
"type" : "integer"
},
"report_type" : {
"type" : "keyword"
},
"risk_ranking" : {
"type" : "keyword"
},
"secu_code" : {
"type" : "keyword"
},
"secu_name" : {
"type" : "keyword"
},
"sentiment" : {
"type" : "integer"
},
"summary" : {
"type" : "text",
"index_options" : "offsets"
},
"title" : {
"type" : "text",
"index_options" : "offsets"
}
}
}
}
}
es会自动推断最合适的类型,比如text/long/date。实际上ES也是强类型语义的,如果long被不恰当的定义为string,在全文检索时将导致非预期的结果。除了默认的定义外,field可以自定义mapping属性,通常是index(用于控制某字段支持精确匹配、模糊匹配还是不支持搜索)和analyzer(声明分析器)这两个属性。不过mapping不能修改,只能在创建时或者新增字段时指定。
Lucene不支持存储null值。
ajax支持
到config文件夹下的elasticsearch.yml,在文件的末尾添加如下内容:
http.cors.enabled: true http.cors.allow-origin: "*"
以便支持在web中通过ajax访问。
query DSL和filter DSL区别:query用于全文检索并得到_score,filter用于精确匹配。
text有精确匹配和全文搜索的区别,long/date以及_id则没有。
Elasticsearch会为每个text field的每个单词建立inverted index索引。
默认情况下,ES区分大小写,复数与非负数,实际上我们希望他们不敏感。还有中文的匹配搜索。这种情况,我们需要使用analyzer,默认的分析器是标准分析器,它基于UNICODE TEXT SEGMENTATION进行分析。ES原生支持的语言分析器包括https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html,其中不包括中文,所以默认每个汉字都是一个term。如果不希望某字段使用默认的分析器,必须通过在这些字段上声明mapping(也叫schema definition,也就是ddl的意思)来手工配置。
使用DSL语言作为查询条件的格式,也就是JSON格式。所有的查询结果都会返回一个_score,表示匹配程度。
问题
Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
解决方法:http://blog.csdn.net/u011403655/article/details/71107415
https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html
在cluster中,有一个节点被选为master节点,其负责集群内的全局管理,比如增加/删除index、节点,但是不管理具体的事情。
查看ES集群状态:
GET http: //localhost:9200/_cluster/health{ "cluster_name": "elasticsearch", "status": "yellow", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 5, "active_shards": 5, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 5, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 50 }
最重要的是status字段。取值为:
- green:All primary and replica shards are active.
- yellow:All primary shards are active, but not all replica shards are active.(对于单节点的环境来说,replica shards没有什么意义)
- red:Not all primary shards are active.
启动第二个节点的时候,节点会自动加入相同名称的cluster.name集群。Elasticsearch能够在节点宕机后自动重新选举master shard,这样就可以重新提供服务了。
查看ES及lucene版本
{
"name" : "t2ztM-f",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "DTTrGi_UR12p8Vbc9MTNAQ",
"version" : {
"number" : "6.3.2",
"build_flavor" : "oss",
"build_type" : "tar",
"build_hash" : "053779d",
"build_date" : "2018-07-20T05:20:23.451332Z",
"build_snapshot" : false,
"lucene_version" : "7.3.1",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
Elasticsearch中,文档中的每个字段都被索引了,一个查询中。
元数据包括:
- _id:唯一标识一个type内的文档
默认情况下,ES基于相关性进行排序。如果要根据字段进行排序,则要指定如下:
GET / _search {
"query": {
"filtered": {
"filter": {
"term": {
"user_id": 1
}
}
}
},
"sort": {
"date": {
"order": "desc"
}
}
}
如果排序不是基于相关性的话,_score不会被计算。计算_score的成本很高,所以指定了sort的话,默认不会计算_score,指定track_scores=true可以强行计算。
多条件匹配,首先根据date,其次根据相关性。
GET / _search {
"query": {
"filtered": {
"query": {
"match": {
"tweet": "manage text search"
}
},
"filter": {
"term": {
"user_id": 2
}
}
}
},
"sort": [{
"date": {
"order": "desc"
}
},
{
"_score": {
"order": "desc"
}
}
]
}
对于全文搜索的字段,排序没有意义,一般用相关度。
ES会将尽可能多的数据保存在内存中以提高性能。
ES的查询称为分布式搜索查询,分为查询和提取两部分。在查询阶段,请求会广播给所有的shard,返回符合条件的top N,根据order by条件。
查看indices层面的状态
GET _cluster/health?level=indices
GET _cluster/health?level=shards
查看节点的状态:
http://localhost:9200/_nodes/stats
删除索引下的所有数据,但是不删除索引本身
POST http://10.20.30.193:9200/search_doc_new_ic/_delete_by_query?refresh { "query": { "match_all": {} } }
{ "took": 147849, "timed_out": false, "total": 3789150, "deleted": 3789150, "batches": 3790, "version_conflicts": 0, "noops": 0, "retries": { "bulk": 0, "search": 0 }, "throttled_millis": 0, "requests_per_second": -1, "throttled_until_millis": 0, "failures": [] }
需要注意的是,删除文档不会删除空间。
查询时match、match_phrase、query_string三者的区别
match相当于已经条件已经分词过,直接传递进去查找。对应pg中xxx::tsquery
query_string相当于未分词过,传递的是原始文本,会先进行分析,然后和match一样。to_tsquery(xxx,xxx)
match_phrase和match的区别是,match不是词组查询,只要包含即可,match_phrase有顺序要求。phraseto_tsquery(xxx,xxx)
注意点
每个JVM内存不要超过32GB,最好在30GB以内(postgresql就没这个问题)、而且java中堆大了之后,GC也是个严重的问题,Elasticsearch和Lucene分别使用1/2的内存。前者使用JVM内存,后者使用OS的filesystem cache。不过如果这样配置的话,为了保证HA,需要设置初始化参数cluster.routing.allocation.same_shard.host:true,防止主和从shard分配到相同的机器。
聚合是通过称为fielddata的数据结构完成的,Fielddata是Elasticsearch集群中内存的最大消耗者。所以必须完全理解它。
Fielddata有点像RDBMS的数据块,只不过应该是行为单位的,会按需加载到内存。Fielddata存在的原因是因为inverted indices不是银弹,inverted indices擅长于找到包含某个分词(term)的文档,但是反过来,在某个文档中存在哪些个term就懵逼了,而聚合需要这种二次访问模式。
ES linux下安装
vi elasticsearch.yml
network.host: 0.0.0.0 否则只有本机才能访问
不能root用户执行,数据库如postgresql、oracle都如此。
groupadd es
useradd -g es es
[2016-12-20T22:37:28,552][ERROR][o.e.b.Bootstrap ] [elk-node1] node validation exception
bootstrap checks failed
解决:使用centos 7版本,就不会出现此类问题了。
system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk
原因:
这是在因为Centos6不支持SecComp,而ES5.2.0默认bootstrap.system_call_filter为true进行检测,所以导致检测失败,失败后直接导致ES不能启动。
解决:
在elasticsearch.yml中配置bootstrap.system_call_filter为false,注意要在Memory下面:
bootstrap.memory_lock: false
bootstrap.system_call_filter: false
ES 7报错
在启动ElasticSearch的过程中爆出了以下错误:
ERROR: [1] bootstrap checks failed
[1]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
修改
elasticsearch.yml
取消注释保留一个节点
cluster.initial_master_nodes: ["node-1"]
另一错误
[2021-09-18T22:15:24,063][ERROR][o.e.i.g.GeoIpDownloader ] [node-1] exception during geoip databases update java.net.ConnectException: Connection refused at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?] at sun.nio.ch.Net.pollConnectNow(Net.java:669) ~[?:?] at sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:549) ~[?:?] at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597) ~[?:?] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:333) ~[?:?] at java.net.Socket.connect(Socket.java:645) ~[?:?] at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:300) ~[?:?] at sun.net.NetworkClient.doConnect(NetworkClient.java:177) ~[?:?] at sun.net.www.http.HttpClient.openServer(HttpClient.java:497) ~[?:?] at sun.net.www.http.HttpClient.openServer(HttpClient.java:600) ~[?:?] at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:265) ~[?:?] at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:379) ~[?:?] at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:189) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1232) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1120) ~[?:?] at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:175) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1653) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1577) ~[?:?] at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527) ~[?:?] at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:308) ~[?:?] at org.elasticsearch.ingest.geoip.HttpClient.lambda$get$0(HttpClient.java:55) ~[ingest-geoip-7.14.1.jar:7.14.1] at java.security.AccessController.doPrivileged(AccessController.java:554) ~[?:?] at org.elasticsearch.ingest.geoip.HttpClient.doPrivileged(HttpClient.java:97) ~[ingest-geoip-7.14.1.jar:7.14.1] at org.elasticsearch.ingest.geoip.HttpClient.get(HttpClient.java:49) ~[ingest-geoip-7.14.1.jar:7.14.1] at org.elasticsearch.ingest.geoip.HttpClient.getBytes(HttpClient.java:40) ~[ingest-geoip-7.14.1.jar:7.14.1] at org.elasticsearch.ingest.geoip.GeoIpDownloader.fetchDatabasesOverview(GeoIpDownloader.java:115) ~[ingest-geoip-7.14.1.jar:7.14.1] at org.elasticsearch.ingest.geoip.GeoIpDownloader.updateDatabases(GeoIpDownloader.java:103) ~[ingest-geoip-7.14.1.jar:7.14.1] at org.elasticsearch.ingest.geoip.GeoIpDownloader.runDownloader(GeoIpDownloader.java:235) [ingest-geoip-7.14.1.jar:7.14.1] at org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:94) [ingest-geoip-7.14.1.jar:7.14.1] at org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:43) [ingest-geoip-7.14.1.jar:7.14.1] at org.elasticsearch.persistent.NodePersistentTasksExecutor$1.doRun(NodePersistentTasksExecutor.java:40) [elasticsearch-7.14.1.jar:7.14.1] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.14.1.jar:7.14.1] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.14.1.jar:7.14.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?] at java.lang.Thread.run(Thread.java:831) [?:?]
原因:此版本将GeoIp功能默认开启了采集。在默认的启动下是会去官网的默认地址下获取最新的Ip的GEO信息。
官方说明 geoip-processor
增加配置 ingest.geoip.downloader.enabled: false即可。
vi /etc/security/limits.conf
添加如下内容:
* soft nofile 65536 * hard nofile 131072 * soft nproc 2048 * hard nproc 4096
vi /etc/sysctl.conf
添加下面配置:
vm.max_map_count=655360
并执行命令:
sysctl -p
然后,重新启动elasticsearch,即可启动成功。
中文搜索安装
elasticsearch-analysis-ik安装 拷贝到ES_HOME/plugins目录下,命名为ik即可。注意小版本也要完全匹配,否则启动报错。
elasticsearch-analysis-pinyin安装 拷贝到ES_HOME/plugins目录下,命名为pinyin即可。支持自定义词库,https://blog.csdn.net/mingover/article/details/79166375
elasticsearch-head的安装可见http://mobz.github.io/elasticsearch-head/,对于rhel 7/windows,没有问题,直接npm start即可。对于rhel 6,安装比较麻烦,特别是在nodejs和npm安装的时候,还要升级gcc到4.8,不然nodejs v6+安装不了,用0.6.x则npmjs各种麻烦。实际上也没什么用,cli都能查到必要的信息。
ES java写入
https://www.cnblogs.com/chenyuanbo/p/10296827.html
https://www.cnblogs.com/cjsblog/p/10232581.html
java报错
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.client.Cancellable
原因及解决方法:版本冲突造成,在POM文件中指定版本就行
<properties> <java.version>1.8</java.version> <elasticsearch.version>7.14.1</elasticsearch.version> </properties>
node settings must not contain any index level settings
arguments.
需要通过REST API设置索引的内容。例如,修改translog参数:
http://10.20.30.193:9200/_all/_settings?preserve_existing=true
{ "index.translog.durability":"async", "index.translog.sync_interval":"30s", "index.translog.flush_threshold_size":"1024mb" }
{ "error": { "root_cause": [ { "type": "resource_already_exists_exception", "reason": "index [search_doc_new_ic/JQR491ldTDKpNum4pWkl7g] already exists", "index_uuid": "JQR491ldTDKpNum4pWkl7g", "index": "search_doc_new_ic" } ], "type": "resource_already_exists_exception", "reason": "index [search_doc_new_ic/JQR491ldTDKpNum4pWkl7g] already exists", "index_uuid": "JQR491ldTDKpNum4pWkl7g", "index": "search_doc_new_ic" }, "status": 400 }
有一种说法是先关闭索引,修改,再打开。但是这个不应该是原因。关闭之后,索引状态就成了unknown(和删除后的瞬间状态一样)。查看索引状态:
{"error":{"root_cause":[{"type":"index_closed_exception","reason":"closed","index_uuid":"ZGJeKccHTiyitcdgqvkVqQ","index":"search_doc_new_ic"}],"type":"index_closed_exception","reason":"closed","index_uuid":"ZGJeKccHTiyitcdgqvkVqQ","index":"search_doc_new_ic"},"status":400}
什么时候需要关闭索引呢?
有些操作必须先关闭索引,才能修改,例如修改索引的默认分词器。
{ "error": { "root_cause": [ { "type": "illegal_argument_exception", "reason": "Can't update non dynamic settings [[index.analysis.analyzer.default.type]] for open indices [[search_doc_new_ic/ga3Y8cBgR8iBbyZsgqltMw]]" } ], "type": "illegal_argument_exception", "reason": "Can't update non dynamic settings [[index.analysis.analyzer.default.type]] for open indices [[search_doc_new_ic/ga3Y8cBgR8iBbyZsgqltMw]]" }, "status": 400 }
POST http://10.20.30.193:9200/search_doc_new_ic/_close
XXX
POST http://10.20.30.193:9200/search_doc_new_ic/_open
elasticsearch开启慢日志记录
# 检查是否开启慢日志记录
GET /test/_settings
# 开启查询慢日志记录
PUT /test/_settings
{
"index.search.slowlog.threshold.query.warn": "1000ms",
"index.search.slowlog.threshold.query.info": "500ms",
"index.search.slowlog.threshold.query.debug": "800ms",
"index.search.slowlog.threshold.query.trace": "200ms",
"index.search.slowlog.threshold.fetch.warn": "1000ms",
"index.search.slowlog.threshold.fetch.info": "500ms",
"index.search.slowlog.threshold.fetch.debug": "800ms",
"index.search.slowlog.threshold.fetch.trace": "200ms",
"index.search.slowlog.level": debug
}
# 开启索引慢日志记录
PUT /test/_settings
{
"index.indexing.slowlog.threshold.index.warn": "1000ms",
"index.indexing.slowlog.threshold.index.info": "500ms",
"index.indexing.slowlog.threshold.index.debug": "500ms",
"index.indexing.slowlog.threshold.index.trace": "500ms",
"index.indexing.slowlog.level": debug,
"index.indexing.slowlog.source": 1000
}
关闭慢日志
PUT /test/_settings
{
"index.indexing.slowlog.threshold.index.warn": null,
"index.indexing.slowlog.threshold.index.info": null,
"index.indexing.slowlog.threshold.index.debug": null,
"index.indexing.slowlog.threshold.index.trace": null,
"index.indexing.slowlog.level": null,
"index.indexing.slowlog.source": null
}
GET shopping/_search { "explain": true, "query": { "match": { "goodsInfoName": "苏泊尔" } } } 输出如下 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 24, "relation" : "eq" }, "max_score" : 5.3067513, "hits" : [ { "_shard" : "[shopping][1]", "_node" : "h665-yAdSzGgjxamBh5CjA", "_index" : "shopping", "_type" : "_doc", "_id" : "10976", "_score" : 5.3067513, "_source" : { "goodsInfoName" : "苏泊尔不锈钢压力锅高压锅YS22ED+苏泊尔保鲜盒饭盒便当盒330mlKB033AE1(银色)", "其他字段省略....." }, "_explanation" : { "value" : 5.3067513, "description" : "weight(goodsInfoName:苏泊尔 in 328) [PerFieldSimilarity], result of:", "details" : [ { "value" : 5.3067513, "description" : "score(freq=2.0), computed as boost * idf * tf from:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 3.6549778, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 10, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 405, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.65996563, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 2.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 11.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 13.553086, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] } }
性能优化
禁用不需要索引的字段。设置属性"index":"not_analyzed"(只支持精确匹配,适合于日期、数字字段,5.x版本开始,也可以设置类型为keyword表示不分词)。最为重要。
禁用_all字段。
对字段不分词,或者不索引,可以节省很多运算,降低 CPU 占用.尤其是 binary 类型,默认情况下占用 CPU 非常高,而这种类型根本不需要进行分词做索引。单个 doc 在建立索引时的运算复杂度,最大的因素 不在于 doc 的字节数或者说某个字段 value 的长度,而是字段的数量. 例如在满负载的写入压力测试中,mapping 相同的情况下,一个有10个字段,200字节的 doc, 通过增加某些字段 value 的长度到500字节,写入 es 时速度下降很少,而如果字段数增加到20,即使整个 doc 字节数没增加多少,写入速度也会降低一倍。
索引
默认情况下,ES索引默认情况下每秒钟刷新一次。因为数据插入到ES时候,先到了in-memory buffer,此时是对外不可见的。只有被索引(分词和建立反转索引的过程)之后,才对外可见。一般会调整为30s或更多,具体多少合适,要看目标机器index的速度以及插入的TPS。设置为-1不意味着不索引了,只是索引是个被动的过程,当translog满了之后,还是会索引的。可见https://stackoverflow.com/questions/36449506/what-exactly-does-1-refresh-interval-in-elasticsearch-mean。
indices.memory.index_buffer_size: 10%* -Xmx
事务日志优化
index.translog.durability:aysnc
index.translog.sync_interval:120s
index.translog.flush_threshold_size:1024mb
分片数量控制
分片越大,索引速度越慢,尤其是单个分片超过几十GB后。