Elasticsearch学习笔记

　　关于采用mongodb等nosql还是es作为存储机制，网上有一些讨论，LZ推荐参考https://blog.csdn.net/awdac/article/details/78117393，简单地说就是es可以认为是相比redis更加智能的加速层，但是它不应该作为直接存储机制，这一点和很多数据库的缓存机制是类似的，例如oracle的结果集缓存、timesten，mysql的query cache，只不过针对的场景不同，例如可以结合语义搜索。所以它的写入效率是比较低的，同时相比redis而言，它要重的多。

Wikipedia使用Elasticsearch作为全文检索的工具
GitHub使用Elasticsearch搜索代码
基于Lucene，Elasticsearch之于SQL，Lucene就像RDBMS引擎
使用java编写

　　启动 ./bin/elasticsearch -d 后台模式
　　http://localhost:9200/?pretty 查看版本等基本信息
　　配置文件config/elasticsearch.yml
　　原生为集群模式，类似rocketmq和kafka
　　节点间使用9300通信
　　请求格式'<PROTOCOL>://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>'，BODY为JSON编码的请求体
　　Elasticsearch使用JSON作为序列化格式。

　　数据库和ES的对应关系如下：
　　Relational DB ⇒ Databases ⇒ Tables ⇒ Rows ⇒ Columns
　　Elasticsearch ⇒ Indices ⇒ Types ⇒ Documents ⇒ Fields
　　一个ES集群包含多个indices。index是一个逻辑命名空间，指向一个或多个shards，相当于oracle的segment。shard是Lucene的一个实例。Shards是Elasticsearch在集群内分布数据的单位。Elasticsearch会根据cluster的扩展和收缩自动在节点间迁移shards。一个shard可能是primary或replica。这跟couchbase的集群管理模式是一样的。默认情况下，一个index中有5个primary shards。

　　ES日常操作有三种客户端工具：postman(REST命令是下拉，查询选择POST即可，GET传递JSON不便)，curl以及es自带的客户端Dev Tools（dev tool有些特殊，有些命令的兼容性更好reindex），三者的命令是一样的，都是REST API。head虽然能用，但是太简单。

Elasticsearch学习笔记

查看所有的索引

GET _cat/indices
yellow open wordbaseinfo_new               KFKrcmJoQqWP9kyLzokLQw 1 1 18990   999 174.9mb 174.9mb
yellow open search_doc_new_test            RjfMfH5-Sdmh7rIgNoWRfw 1 1  2261     0  83.9mb  83.9mb
yellow open testsearch                     3nFp58OXSCCDCZKNBSr8yg 1 1     0     0    208b    208b
green  open .kibana-event-log-7.9.0-000004 zrGu0cA0Sle1GHIV2w-szQ 1 0     0     0    208b    208b
green  open .kibana-event-log-7.9.0-000005 8r7NEIxHSeGt1qCX98TFlg 1 0     0     0    208b    208b
green  open .kibana-event-log-7.9.0-000006 KaC-CnfhTDC81EZMUd6XeQ 1 0     0     0    208b    208b
green  open .kibana-event-log-7.9.0-000007 nE_Wv1ibQIW9cRGSt_IZfg 1 0     0     0    208b    208b
green  open .apm-custom-link               CksbWamWQvaafywczHUbwA 1 0     0     0    208b    208b
yellow open fais_search                    OEjrM5YwSJulOhD3T2y7Ig 1 1    64    15   2.6mb   2.6mb
green  open .kibana_task_manager_1         dhMlGVLjQ7Kq-VdrtI6RMg 1 0     6 20650  14.2mb  14.2mb
yellow open inrulebaseinfo_new             om0AqwSPRVqVq6GClq42zQ 1 1     8    10   325kb   325kb
yellow open fais_test                      nDG2Ou9MSyKaShcB4kLzBA 1 1     0     0    208b    208b
yellow open fail_search_test               Gfe4cbi9RX-Dk9fxoAQH3g 1 1 51339    42 907.7mb 907.7mb
yellow open word_item                      W9m8FuFRTzaagZU29y78mw 1 1     0     0    208b    208b
yellow open search_doc_new_ic              tCZigJFUTn6OWEQ3dH013A 1 1 75783     0   2.9gb   2.9gb
yellow open wordbaseinfo_new_for_test      fN12XUf6ScCdkIcI01IhfQ 1 1 18854  8440 139.4mb 139.4mb
yellow open worditem                       uxkzSZToTp6cVkXdwsXSDg 1 1     0     0    208b    208b
green  open .apm-agent-configuration       zaONhEkUTKqnAZbbTzCs0Q 1 0     0     0    208b    208b
yellow open inrulebaseinfo_new_for_test    SWj5BfMWTRyJH8WX7aXCKQ 1 1     0     0    208b    208b
yellow open casebaseinfo                   rfqoCTfGQqOaCNRtbbkS_Q 1 1 17843     0  55.2mb  55.2mb
yellow open time_test                      lW9FMLz1TuKzy6inK-gG0A 1 1     0     0    208b    208b
green  open .kibana_1                      -vM1KSWdQG2zshWD4K0PPg 1 0   615     7  10.4mb  10.4mb
yellow open article                        IpktM1wTSPO6B1Tp-eEiXA 1 1  1056     0   6.1mb   6.1mb
green  open .tasks                         1DlF3FRSSvq2sB4ikpydCw 1 0     5     0  20.2kb  20.2kb
yellow open search_doc_new_ic1             EUNqO51GTTGSycoHYhfZoA 1 1     0     0    208b    208b
yellow open search_doc_new_ic_zjhua        mARAxLD5QBGQFC6VcCdVVA 1 1 75783     0   3.5gb   3.5gb
yellow open casebaseinfo_for_test          Tt9EX2yYSHGDxeunzM4D5g 1 1 16981     0    50mb    50mb

创建索引

PUT http://localhost:9200/blogs

{
	"settings": {
		"number_of_shards": 3,
		"number_of_replicas": 1
	}
}

索引重建

　　PUT search_doc_new_ic_zjhua

POST _reindex
{
  "source": {
    "index": "search_doc_new_ic"
  },
  "dest": {
    "index": "search_doc_new_ic_zjhua"
  }
}

Elasticsearch学习笔记

　　执行成功了，但是7.14.1通过GET _tasks?actions=indices:data/write/reindex却查出来为空。

　　客户端容易超时，可以通过GET _tasks?actions=indices:data/write/reindex进行监控。

　　注意点：https://www.dazhuanlan.com/dolores63134/topics/1364488

　　重建索引可能导致数据丢失，见：https://segmentfault.com/q/1010000019003891。

　　还有一种是直接重建（必须是重建），以及重建的索引，见：https://blog.csdn.net/yexiaomodemo/article/details/97979376。

创建文档

　　格式：PUT {index}/{type}/{id}需要修改成PUT {index}/_doc/{id}

　　用postman PUT http://localhost:9200/megacorp/employee/1 -d '{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}'
　　返回 {"_index":"megacorp","_type":"employee","_id":"2","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}

ES 7.x去掉了

我们一直认为ES中的“index”类似于关系型数据库的“database”，而“type”相当于一个数据表。ES的开发者们认为这是一个糟糕的认识。例如：关系型数据库中两个数据表示是独立的，即使他们里面有相同名称的列也不影响使用，但ES中不是这样的。

我们都知道elasticsearch是基于Lucene开发的搜索引擎，而ES中不同type下名称相同的filed最终在Lucene中的处理方式是一样的。举个例子，两个不同type下的两个user_name，在ES同一个索引下其实被认为是同一个filed，你必须在两个不同的type中定义相同的filed映射。否则，不同type中的相同字段名称就会在处理中出现冲突的情况，导致Lucene处理效率下降。

去掉type能够使数据存储在独立的index中，这样即使有相同的字段名称也不会出现冲突，就像ElasticSearch出现的第一句话一样“你知道的，为了搜索····”，去掉type就是为了提高ES处理数据的效率。

除此之外，在同一个索引的不同type下存储字段数不一样的实体会导致存储中出现稀疏数据，影响Lucene压缩文档的能力，导致ES查询效率的降低。

https://blog.csdn.net/can_do_it/article/details/84884757

https://blog.csdn.net/zjx546391707/article/details/78631394

　　如果没有设置ID，则ES会自动生成一个。如：
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_version": 13,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"created": false
}
　　_version代表更改的次数，一般来说，id不应该自动生成。

　　文档存储在哪个shard中的公式如下：shard = hash(routing) % number_of_primary_shards

　　routing默认是_id。
　　默认情况下，replication=sync。默认情况下replica=1。
　　会自动创建index megacorp，声明类型为employee，编号为1。

搜索文档

　　GET http://localhost:9200/megacorp/employee/1
存在时如下：
{"_index":"megacorp","_type":"employee","_id":"1","_version":1,"found":true,"_source":{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}}
　　_source中包含JSON原文档。
　　http://localhost:9200/megacorp/employee/111
　　不存在时如下：
　　{"_index":"megacorp","_type":"employee","_id":"111","found":false}
　　同时HTTP HEAD为404

查询指定字段

GET http: //localhost:9200/megacorp/employee/1?_source=first_name
	{
		"_index": "megacorp",
		"_type": "employee",
		"_id": "1",
		"_version": 13,
		"found": true,
		"_source": {
			"first_name": "John"
		}
	}

删除

DELETE http: //localhost:9200/megacorp/employee/111
	{
		"found": true,
		"_index": "megacorp",
		"_type": "employee",
		"_id": "1",
		"_version": 2,
		"result": "deleted",
		"_shards": {
			"total": 2,
			"successful": 1,
			"failed": 0
		}
	}

模糊搜索

　　精确搜索就没有必要使用ES了，所以模糊搜索才是关键。

/_search {
	"took": 5,
	"timed_out": false,
	"_shards": {
		"total": 5,
		"successful": 5,
		"failed": 0
	},
	"hits": {
		"total": 5,
		"max_score": 1,
		"hits": [{
				"_index": "megacorp",
				"_type": "employee",
				"_id": "AV3Kp7BqVnBASvmzDScd",
				"_score": 1,
				"_source": {
					"first_name": "John",
					"last_name": "Smith",
					"age": 25,
					"about": "I love to go rock climbing",
					"interests": [
						"sports",
						"music"
					]
				}
			},
			{
				"_index": "megacorp",
				"_type": "employee",
				"_id": "2",
				"_score": 1,
				"_source": {
					"first_name": "Jane",
					"last_name": "Smith",
					"age": 32,
					"about": "I like to collect rock albums",
					"interests": [
						"music"
					]
				}
			},
			{
				"_index": "megacorp",
				"_type": "employee",
				"_id": "AV3Kp5hsVnBASvmzDScc",
				"_score": 1,
				"_source": {
					"first_name": "John",
					"last_name": "Smith",
					"age": 25,
					"about": "I love to go rock climbing",
					"interests": [
						"sports",
						"music"
					]
				}
			},
			{
				"_index": "megacorp",
				"_type": "employee",
				"_id": "1",
				"_score": 1,
				"_source": {
					"first_name": "John",
					"last_name": "Smith",
					"age": 25,
					"about": "I love to go rock climbing",
					"interests": [
						"sports",
						"music"
					]
				}
			},
			{
				"_index": "megacorp",
				"_type": "employee",
				"_id": "3",
				"_score": 1,
				"_source": {
					"first_name": "Douglas",
					"last_name": "Fir",
					"age": 35,
					"about": "I like to build cabinets",
					"interests": [
						"forestry"
					]
				}
			}
		]
	}
}

　　默认情况下，hits返回符合条件的前面10行，_score从高到低。如果要分页，则需要加上：http://localhost:9200/megacorp/employee/_search?size=2&from=2

　　搜索所有字段，真正的全文检索：http://localhost:9200/megacorp/employee/_search?q=John 在后台，其实是查询所有字段，内部有一个隐含的_all字段，类型为string。
　　各种语法可以参考https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax

type/mapping的结构（mapping/模式定义）

GET search_doc_new_ic/_mapping   /* es7之前要用my_index/_mapping/my_type，7去掉了type*/
{
  "search_doc_new_ic" : {
    "mappings" : {
      "properties" : {
        "authors" : {
          "type" : "keyword"
        },
        "content" : {
          "properties" : {
            "page_no" : {
              "type" : "integer"
            },
            "paragraphs" : {
              "type" : "text",
              "index_options" : "offsets"
            }
          }
        },
        "doc_id" : {
          "type" : "keyword"
        },
        "doc_source" : {
          "type" : "keyword"
        },
        "file_id" : {
          "type" : "keyword"
        },
        "file_name" : {
          "type" : "keyword"
        },
        "fstore_group" : {
          "type" : "keyword"
        },
        "fstore_path" : {
          "type" : "keyword"
        },
        "industry_chain_nodes" : {
          "properties" : {
            "code" : {
              "type" : "keyword"
            },
            "name" : {
              "type" : "keyword"
            }
          }
        },
        "industry_chains" : {
          "properties" : {
            "code" : {
              "type" : "keyword"
            },
            "name" : {
              "type" : "keyword"
            }
          }
        },
        "industry_code" : {
          "type" : "keyword"
        },
        "industry_name" : {
          "type" : "keyword"
        },
        "invest_ranking" : {
          "type" : "keyword"
        },
        "local_path" : {
          "type" : "keyword"
        },
        "org_name" : {
          "type" : "keyword"
        },
        "page_count" : {
          "type" : "integer"
        },
        "publish_date" : {
          "type" : "date"
        },
        "pv" : {
          "type" : "integer"
        },
        "report_type" : {
          "type" : "keyword"
        },
        "risk_ranking" : {
          "type" : "keyword"
        },
        "secu_code" : {
          "type" : "keyword"
        },
        "secu_name" : {
          "type" : "keyword"
        },
        "sentiment" : {
          "type" : "integer"
        },
        "summary" : {
          "type" : "text",
          "index_options" : "offsets"
        },
        "title" : {
          "type" : "text",
          "index_options" : "offsets"
        }
      }
    }
  }
}

　　es会自动推断最合适的类型，比如text/long/date。实际上ES也是强类型语义的，如果long被不恰当的定义为string，在全文检索时将导致非预期的结果。除了默认的定义外，field可以自定义mapping属性，通常是index（用于控制某字段支持精确匹配、模糊匹配还是不支持搜索）和analyzer（声明分析器）这两个属性。不过mapping不能修改，只能在创建时或者新增字段时指定。

　　Lucene不支持存储null值。

ajax支持

　　到config文件夹下的elasticsearch.yml，在文件的末尾添加如下内容：

http.cors.enabled: true
http.cors.allow-origin: "*"

　　以便支持在web中通过ajax访问。

　　query DSL和filter DSL区别：query用于全文检索并得到_score，filter用于精确匹配。
　　text有精确匹配和全文搜索的区别，long/date以及_id则没有。
　　Elasticsearch会为每个text field的每个单词建立inverted index索引。
　　默认情况下，ES区分大小写，复数与非负数，实际上我们希望他们不敏感。还有中文的匹配搜索。这种情况，我们需要使用analyzer，默认的分析器是标准分析器，它基于UNICODE TEXT SEGMENTATION进行分析。ES原生支持的语言分析器包括https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html，其中不包括中文，所以默认每个汉字都是一个term。如果不希望某字段使用默认的分析器，必须通过在这些字段上声明mapping（也叫schema definition，也就是ddl的意思）来手工配置。
　　使用DSL语言作为查询条件的格式，也就是JSON格式。所有的查询结果都会返回一个_score，表示匹配程度。

问题

　　Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
　　解决方法：http://blog.csdn.net/u011403655/article/details/71107415
https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html

　　在cluster中，有一个节点被选为master节点，其负责集群内的全局管理，比如增加/删除index、节点，但是不管理具体的事情。

查看ES集群状态：

GET http: //localhost:9200/_cluster/health{
    "cluster_name": "elasticsearch",
    "status": "yellow",
    "timed_out": false,
    "number_of_nodes": 1,
    "number_of_data_nodes": 1,
    "active_primary_shards": 5,
    "active_shards": 5,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 5,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 50
}

　　最重要的是status字段。取值为：

yellow：All primary shards are active, but not all replica shards are active.（对于单节点的环境来说，replica shards没有什么意义）
red：Not all primary shards are active.

　　启动第二个节点的时候，节点会自动加入相同名称的cluster.name集群。Elasticsearch能够在节点宕机后自动重新选举master shard，这样就可以重新提供服务了。

查看ES及lucene版本

{
  "name" : "t2ztM-f",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "DTTrGi_UR12p8Vbc9MTNAQ",
  "version" : {
    "number" : "6.3.2",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "053779d",
    "build_date" : "2018-07-20T05:20:23.451332Z",
    "build_snapshot" : false,
    "lucene_version" : "7.3.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Elasticsearch中，文档中的每个字段都被索引了，一个查询中。

元数据包括:

_id：唯一标识一个type内的文档

　　默认情况下，ES基于相关性进行排序。如果要根据字段进行排序，则要指定如下：

GET / _search {
	"query": {
		"filtered": {
			"filter": {
				"term": {
					"user_id": 1
				}
			}
		}
	},
	"sort": {
		"date": {
			"order": "desc"
		}
	}
}

　　如果排序不是基于相关性的话，_score不会被计算。计算_score的成本很高，所以指定了sort的话，默认不会计算_score，指定track_scores=true可以强行计算。

　　多条件匹配，首先根据date，其次根据相关性。

GET / _search {
	"query": {
		"filtered": {
			"query": {
				"match": {
					"tweet": "manage text search"
				}
			},
			"filter": {
				"term": {
					"user_id": 2
				}
			}
		}
	},
	"sort": [{
			"date": {
				"order": "desc"
			}
		},
		{
			"_score": {
				"order": "desc"
			}
		}
	]
}

　　对于全文搜索的字段，排序没有意义，一般用相关度。

　　ES会将尽可能多的数据保存在内存中以提高性能。
　　ES的查询称为分布式搜索查询，分为查询和提取两部分。在查询阶段，请求会广播给所有的shard，返回符合条件的top N，根据order by条件。

查看indices层面的状态

GET _cluster/health?level=indices
GET _cluster/health?level=shards
查看节点的状态：
http://localhost:9200/_nodes/stats

删除索引下的所有数据，但是不删除索引本身

POST http://10.20.30.193:9200/search_doc_new_ic/_delete_by_query?refresh
{ "query": { "match_all": {} } }

{
    "took": 147849,
    "timed_out": false,
    "total": 3789150,
    "deleted": 3789150,
    "batches": 3790,
    "version_conflicts": 0,
    "noops": 0,
    "retries": {
        "bulk": 0,
        "search": 0
    },
    "throttled_millis": 0,
    "requests_per_second": -1,
    "throttled_until_millis": 0,
    "failures": []
}

　　需要注意的是，删除文档不会删除空间。

Elasticsearch学习笔记

查询时match、match_phrase、query_string三者的区别

　　match相当于已经条件已经分词过，直接传递进去查找。对应pg中xxx::tsquery

　　query_string相当于未分词过，传递的是原始文本，会先进行分析，然后和match一样。to_tsquery(xxx,xxx)

　　match_phrase和match的区别是，match不是词组查询，只要包含即可，match_phrase有顺序要求。phraseto_tsquery(xxx,xxx)

注意点

　　每个JVM内存不要超过32GB，最好在30GB以内（postgresql就没这个问题）、而且java中堆大了之后，GC也是个严重的问题，Elasticsearch和Lucene分别使用1/2的内存。前者使用JVM内存，后者使用OS的filesystem cache。不过如果这样配置的话，为了保证HA，需要设置初始化参数cluster.routing.allocation.same_shard.host:true，防止主和从shard分配到相同的机器。
　　聚合是通过称为fielddata的数据结构完成的，Fielddata是Elasticsearch集群中内存的最大消耗者。所以必须完全理解它。
　　Fielddata有点像RDBMS的数据块，只不过应该是行为单位的，会按需加载到内存。Fielddata存在的原因是因为inverted indices不是银弹，inverted indices擅长于找到包含某个分词（term）的文档，但是反过来，在某个文档中存在哪些个term就懵逼了，而聚合需要这种二次访问模式。

ES linux下安装

vi elasticsearch.yml
network.host: 0.0.0.0 否则只有本机才能访问
不能root用户执行，数据库如postgresql、oracle都如此。
groupadd es
useradd -g es es
[2016-12-20T22:37:28,552][ERROR][o.e.b.Bootstrap ] [elk-node1] node validation exception
bootstrap checks failed
解决：使用centos 7版本，就不会出现此类问题了。
system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk
原因：
　　这是在因为Centos6不支持SecComp，而ES5.2.0默认bootstrap.system_call_filter为true进行检测，所以导致检测失败，失败后直接导致ES不能启动。
解决：
　　在elasticsearch.yml中配置bootstrap.system_call_filter为false，注意要在Memory下面:
　　bootstrap.memory_lock: false
　　bootstrap.system_call_filter: false

ES 7报错

在启动ElasticSearch的过程中爆出了以下错误：

ERROR: [1] bootstrap checks failed
[1]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
修改
elasticsearch.yml
取消注释保留一个节点
cluster.initial_master_nodes: ["node-1"]

另一错误

[2021-09-18T22:15:24,063][ERROR][o.e.i.g.GeoIpDownloader  ] [node-1] exception during geoip databases update
java.net.ConnectException: Connection refused
    at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
    at sun.nio.ch.Net.pollConnectNow(Net.java:669) ~[?:?]
    at sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:549) ~[?:?]
    at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597) ~[?:?]
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:333) ~[?:?]
    at java.net.Socket.connect(Socket.java:645) ~[?:?]
    at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:300) ~[?:?]
    at sun.net.NetworkClient.doConnect(NetworkClient.java:177) ~[?:?]
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:497) ~[?:?]
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:600) ~[?:?]
    at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:265) ~[?:?]
    at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:379) ~[?:?]
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:189) ~[?:?]
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1232) ~[?:?]
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1120) ~[?:?]
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:175) ~[?:?]
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1653) ~[?:?]
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1577) ~[?:?]
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527) ~[?:?]
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:308) ~[?:?]
    at org.elasticsearch.ingest.geoip.HttpClient.lambda$get$0(HttpClient.java:55) ~[ingest-geoip-7.14.1.jar:7.14.1]
    at java.security.AccessController.doPrivileged(AccessController.java:554) ~[?:?]
    at org.elasticsearch.ingest.geoip.HttpClient.doPrivileged(HttpClient.java:97) ~[ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.HttpClient.get(HttpClient.java:49) ~[ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.HttpClient.getBytes(HttpClient.java:40) ~[ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.GeoIpDownloader.fetchDatabasesOverview(GeoIpDownloader.java:115) ~[ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.GeoIpDownloader.updateDatabases(GeoIpDownloader.java:103) ~[ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.GeoIpDownloader.runDownloader(GeoIpDownloader.java:235) [ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:94) [ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:43) [ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.persistent.NodePersistentTasksExecutor$1.doRun(NodePersistentTasksExecutor.java:40) [elasticsearch-7.14.1.jar:7.14.1]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.14.1.jar:7.14.1]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.14.1.jar:7.14.1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
    at java.lang.Thread.run(Thread.java:831) [?:?]

原因：此版本将GeoIp功能默认开启了采集。在默认的启动下是会去官网的默认地址下获取最新的Ip的GEO信息。
官方说明 geoip-processor

增加配置 ingest.geoip.downloader.enabled: false即可。

vi /etc/security/limits.conf
添加如下内容:

* soft nofile 65536
* hard nofile 131072
* soft nproc 2048
* hard nproc 4096

vi /etc/sysctl.conf
添加下面配置：
vm.max_map_count=655360
并执行命令：
sysctl -p
然后，重新启动elasticsearch，即可启动成功。

中文搜索安装

　　elasticsearch-analysis-ik安装拷贝到ES_HOME/plugins目录下，命名为ik即可。注意小版本也要完全匹配，否则启动报错。
　　elasticsearch-analysis-pinyin安装拷贝到ES_HOME/plugins目录下，命名为pinyin即可。支持自定义词库，https://blog.csdn.net/mingover/article/details/79166375
　　elasticsearch-head的安装可见http://mobz.github.io/elasticsearch-head/，对于rhel 7/windows，没有问题，直接npm start即可。对于rhel 6，安装比较麻烦，特别是在nodejs和npm安装的时候，还要升级gcc到4.8，不然nodejs v6+安装不了，用0.6.x则npmjs各种麻烦。实际上也没什么用，cli都能查到必要的信息。

ES java写入

https://www.cnblogs.com/chenyuanbo/p/10296827.html

https://www.cnblogs.com/cjsblog/p/10232581.html

java报错

Caused by: java.lang.ClassNotFoundException: org.elasticsearch.client.Cancellable

原因及解决方法：版本冲突造成，在POM文件中指定版本就行

<properties>
　　<java.version>1.8</java.version>
　　<elasticsearch.version>7.14.1</elasticsearch.version>
</properties>

node settings must not contain any index level settings

arguments.

需要通过REST API设置索引的内容。例如，修改translog参数：

http://10.20.30.193:9200/_all/_settings?preserve_existing=true
{
  "index.translog.durability":"async",
  "index.translog.sync_interval":"30s",
  "index.translog.flush_threshold_size":"1024mb"
}

{
    "error": {
        "root_cause": [
            {
                "type": "resource_already_exists_exception",
                "reason": "index [search_doc_new_ic/JQR491ldTDKpNum4pWkl7g] already exists",
                "index_uuid": "JQR491ldTDKpNum4pWkl7g",
                "index": "search_doc_new_ic"
            }
        ],
        "type": "resource_already_exists_exception",
        "reason": "index [search_doc_new_ic/JQR491ldTDKpNum4pWkl7g] already exists",
        "index_uuid": "JQR491ldTDKpNum4pWkl7g",
        "index": "search_doc_new_ic"
    },
    "status": 400
}

有一种说法是先关闭索引，修改，再打开。但是这个不应该是原因。关闭之后，索引状态就成了unknown（和删除后的瞬间状态一样）。查看索引状态：

{"error":{"root_cause":[{"type":"index_closed_exception","reason":"closed","index_uuid":"ZGJeKccHTiyitcdgqvkVqQ","index":"search_doc_new_ic"}],"type":"index_closed_exception","reason":"closed","index_uuid":"ZGJeKccHTiyitcdgqvkVqQ","index":"search_doc_new_ic"},"status":400}

Elasticsearch学习笔记

什么时候需要关闭索引呢？

　　有些操作必须先关闭索引，才能修改，例如修改索引的默认分词器。

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "Can't update non dynamic settings [[index.analysis.analyzer.default.type]] for open indices [[search_doc_new_ic/ga3Y8cBgR8iBbyZsgqltMw]]"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "Can't update non dynamic settings [[index.analysis.analyzer.default.type]] for open indices [[search_doc_new_ic/ga3Y8cBgR8iBbyZsgqltMw]]"
    },
    "status": 400
}

POST http://10.20.30.193:9200/search_doc_new_ic/_close

XXX

POST http://10.20.30.193:9200/search_doc_new_ic/_open

elasticsearch开启慢日志记录

# 检查是否开启慢日志记录
GET /test/_settings


# 开启查询慢日志记录
PUT /test/_settings
{
    "index.search.slowlog.threshold.query.warn": "1000ms",
    "index.search.slowlog.threshold.query.info": "500ms",
    "index.search.slowlog.threshold.query.debug": "800ms",
    "index.search.slowlog.threshold.query.trace": "200ms",
    "index.search.slowlog.threshold.fetch.warn": "1000ms",
    "index.search.slowlog.threshold.fetch.info": "500ms",
    "index.search.slowlog.threshold.fetch.debug": "800ms",
    "index.search.slowlog.threshold.fetch.trace": "200ms",
    "index.search.slowlog.level": debug
}

# 开启索引慢日志记录
PUT /test/_settings
{
    "index.indexing.slowlog.threshold.index.warn": "1000ms",
    "index.indexing.slowlog.threshold.index.info": "500ms",
    "index.indexing.slowlog.threshold.index.debug": "500ms",
    "index.indexing.slowlog.threshold.index.trace": "500ms",
    "index.indexing.slowlog.level": debug,
    "index.indexing.slowlog.source": 1000
}

关闭慢日志

PUT /test/_settings
{
    "index.indexing.slowlog.threshold.index.warn": null,
    "index.indexing.slowlog.threshold.index.info": null,
    "index.indexing.slowlog.threshold.index.debug": null,
    "index.indexing.slowlog.threshold.index.trace": null,
    "index.indexing.slowlog.level": null,
    "index.indexing.slowlog.source": null
}

查看执行计划

GET shopping/_search
{
  "explain": true, 
  "query": {
    "match": {
        "goodsInfoName": "苏泊尔"
    }
  }
}

输出如下

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 24,
      "relation" : "eq"
    },
    "max_score" : 5.3067513,
    "hits" : [
      {
        "_shard" : "[shopping][1]",
        "_node" : "h665-yAdSzGgjxamBh5CjA",
        "_index" : "shopping",
        "_type" : "_doc",
        "_id" : "10976",
        "_score" : 5.3067513,
        "_source" : {
           "goodsInfoName" : "苏泊尔不锈钢压力锅高压锅YS22ED+苏泊尔保鲜盒饭盒便当盒330mlKB033AE1(银色)",
           "其他字段省略....."
        },
        "_explanation" : {
          "value" : 5.3067513,
          "description" : "weight(goodsInfoName:苏泊尔 in 328) [PerFieldSimilarity], result of:",
          "details" : [
            {
              "value" : 5.3067513,
              "description" : "score(freq=2.0), computed as boost * idf * tf from:",
              "details" : [
                {
                  "value" : 2.2,
                  "description" : "boost",
                  "details" : [ ]
                },
                {
                  "value" : 3.6549778,
                  "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details" : [
                    {
                      "value" : 10,
                      "description" : "n, number of documents containing term",
                      "details" : [ ]
                    },
                    {
                      "value" : 405,
                      "description" : "N, total number of documents with field",
                      "details" : [ ]
                    }
                  ]
                },
                {
                  "value" : 0.65996563,
                  "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  "details" : [
                    {
                      "value" : 2.0,
                      "description" : "freq, occurrences of term within document",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.2,
                      "description" : "k1, term saturation parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.75,
                      "description" : "b, length normalization parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 11.0,
                      "description" : "dl, length of field",
                      "details" : [ ]
                    },
                    {
                      "value" : 13.553086,
                      "description" : "avgdl, average length of field",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }

性能优化

　　禁用不需要索引的字段。设置属性"index":"not_analyzed"（只支持精确匹配，适合于日期、数字字段，5.x版本开始，也可以设置类型为keyword表示不分词）。最为重要。

　　禁用_all字段。

　　对字段不分词,或者不索引,可以节省很多运算,降低 CPU 占用.尤其是 binary 类型,默认情况下占用 CPU 非常高,而这种类型根本不需要进行分词做索引。单个 doc 在建立索引时的运算复杂度,最大的因素不在于 doc 的字节数或者说某个字段 value 的长度,而是字段的数量. 例如在满负载的写入压力测试中,mapping 相同的情况下,一个有10个字段,200字节的 doc, 通过增加某些字段 value 的长度到500字节,写入 es 时速度下降很少,而如果字段数增加到20,即使整个 doc 字节数没增加多少,写入速度也会降低一倍。

索引

　　默认情况下，ES索引默认情况下每秒钟刷新一次。因为数据插入到ES时候，先到了in-memory buffer，此时是对外不可见的。只有被索引（分词和建立反转索引的过程）之后，才对外可见。一般会调整为30s或更多，具体多少合适，要看目标机器index的速度以及插入的TPS。设置为-1不意味着不索引了，只是索引是个被动的过程，当translog满了之后，还是会索引的。可见https://stackoverflow.com/questions/36449506/what-exactly-does-1-refresh-interval-in-elasticsearch-mean。

　　indices.memory.index_buffer_size: 10%* -Xmx

事务日志优化

　　index.translog.durability:aysnc

　　index.translog.sync_interval:120s

　　index.translog.flush_threshold_size:1024mb

分片数量控制

　　分片越大，索引速度越慢，尤其是单个分片超过几十GB后。