关于采用mongodb等nosql还是es作为存储机制,网上有一些讨论,LZ推荐参考https://blog.csdn.net/awdac/article/details/78117393,简单地说就是es可以认为是相比redis更加智能的加速层,但是它不应该作为直接存储机制,这一点和很多数据库的缓存机制是类似的,例如oracle的结果集缓存、timesten,mysql的query cache,只不过针对的场景不同,例如可以结合语义搜索。所以它的写入效率是比较低的,同时相比redis而言,它要重的多。

  • Wikipedia使用Elasticsearch作为全文检索的工具
  • GitHub使用Elasticsearch搜索代码
  • 基于Lucene,Elasticsearch之于SQL,Lucene就像RDBMS引擎
  • 使用java编写

  启动 ./bin/elasticsearch -d 后台模式
  http://localhost:9200/?pretty 查看版本等基本信息
  配置文件config/elasticsearch.yml
  原生为集群模式,类似rocketmq和kafka
  节点间使用9300通信
  请求格式'<PROTOCOL>://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>',BODY为JSON编码的请求体
  Elasticsearch使用JSON作为序列化格式。

  数据库和ES的对应关系如下:
  Relational DB ⇒ Databases ⇒ Tables ⇒ Rows ⇒ Columns
  Elasticsearch ⇒ Indices ⇒ Types ⇒ Documents ⇒ Fields
  一个ES集群包含多个indices。index是一个逻辑命名空间,指向一个或多个shards,相当于oracle的segment。shard是Lucene的一个实例。Shards是Elasticsearch在集群内分布数据的单位。Elasticsearch会根据cluster的扩展和收缩自动在节点间迁移shards。一个shard可能是primary或replica。这跟couchbase的集群管理模式是一样的。默认情况下,一个index中有5个primary shards。

  ES日常操作有三种客户端工具:postman(REST命令是下拉,查询选择POST即可,GET传递JSON不便),curl以及es自带的客户端Dev Tools(dev tool有些特殊,有些命令的兼容性更好reindex),三者的命令是一样的,都是REST API。head虽然能用,但是太简单。

Elasticsearch学习笔记

查看所有的索引

GET _cat/indices
yellow open wordbaseinfo_new               KFKrcmJoQqWP9kyLzokLQw 1 1 18990   999 174.9mb 174.9mb
yellow open search_doc_new_test            RjfMfH5-Sdmh7rIgNoWRfw 1 1  2261     0  83.9mb  83.9mb
yellow open testsearch                     3nFp58OXSCCDCZKNBSr8yg 1 1     0     0    208b    208b
green  open .kibana-event-log-7.9.0-000004 zrGu0cA0Sle1GHIV2w-szQ 1 0     0     0    208b    208b
green  open .kibana-event-log-7.9.0-000005 8r7NEIxHSeGt1qCX98TFlg 1 0     0     0    208b    208b
green  open .kibana-event-log-7.9.0-000006 KaC-CnfhTDC81EZMUd6XeQ 1 0     0     0    208b    208b
green  open .kibana-event-log-7.9.0-000007 nE_Wv1ibQIW9cRGSt_IZfg 1 0     0     0    208b    208b
green  open .apm-custom-link               CksbWamWQvaafywczHUbwA 1 0     0     0    208b    208b
yellow open fais_search                    OEjrM5YwSJulOhD3T2y7Ig 1 1    64    15   2.6mb   2.6mb
green  open .kibana_task_manager_1         dhMlGVLjQ7Kq-VdrtI6RMg 1 0     6 20650  14.2mb  14.2mb
yellow open inrulebaseinfo_new             om0AqwSPRVqVq6GClq42zQ 1 1     8    10   325kb   325kb
yellow open fais_test                      nDG2Ou9MSyKaShcB4kLzBA 1 1     0     0    208b    208b
yellow open fail_search_test               Gfe4cbi9RX-Dk9fxoAQH3g 1 1 51339    42 907.7mb 907.7mb
yellow open word_item                      W9m8FuFRTzaagZU29y78mw 1 1     0     0    208b    208b
yellow open search_doc_new_ic              tCZigJFUTn6OWEQ3dH013A 1 1 75783     0   2.9gb   2.9gb
yellow open wordbaseinfo_new_for_test      fN12XUf6ScCdkIcI01IhfQ 1 1 18854  8440 139.4mb 139.4mb
yellow open worditem                       uxkzSZToTp6cVkXdwsXSDg 1 1     0     0    208b    208b
green  open .apm-agent-configuration       zaONhEkUTKqnAZbbTzCs0Q 1 0     0     0    208b    208b
yellow open inrulebaseinfo_new_for_test    SWj5BfMWTRyJH8WX7aXCKQ 1 1     0     0    208b    208b
yellow open casebaseinfo                   rfqoCTfGQqOaCNRtbbkS_Q 1 1 17843     0  55.2mb  55.2mb
yellow open time_test                      lW9FMLz1TuKzy6inK-gG0A 1 1     0     0    208b    208b
green  open .kibana_1                      -vM1KSWdQG2zshWD4K0PPg 1 0   615     7  10.4mb  10.4mb
yellow open article                        IpktM1wTSPO6B1Tp-eEiXA 1 1  1056     0   6.1mb   6.1mb
green  open .tasks                         1DlF3FRSSvq2sB4ikpydCw 1 0     5     0  20.2kb  20.2kb
yellow open search_doc_new_ic1             EUNqO51GTTGSycoHYhfZoA 1 1     0     0    208b    208b
yellow open search_doc_new_ic_zjhua        mARAxLD5QBGQFC6VcCdVVA 1 1 75783     0   3.5gb   3.5gb
yellow open casebaseinfo_for_test          Tt9EX2yYSHGDxeunzM4D5g 1 1 16981     0    50mb    50mb

创建索引

PUT http://localhost:9200/blogs

{
	"settings": {
		"number_of_shards": 3,
		"number_of_replicas": 1
	}
}

索引重建

  PUT search_doc_new_ic_zjhua

POST _reindex
{
  "source": {
    "index": "search_doc_new_ic"
  },
  "dest": {
    "index": "search_doc_new_ic_zjhua"
  }
}

Elasticsearch学习笔记

 

   执行成功了,但是7.14.1通过GET _tasks?actions=indices:data/write/reindex却查出来为空。

  客户端容易超时,可以通过GET _tasks?actions=indices:data/write/reindex进行监控。

  注意点:https://www.dazhuanlan.com/dolores63134/topics/1364488

  重建索引可能导致数据丢失,见:https://segmentfault.com/q/1010000019003891。

  还有一种是直接重建(必须是重建),以及重建的索引,见:https://blog.csdn.net/yexiaomodemo/article/details/97979376

创建文档

  格式:PUT {index}/{type}/{id}需要修改成PUT {index}/_doc/{id}

  用postman PUT http://localhost:9200/megacorp/employee/1 -d '{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}'
  返回 {"_index":"megacorp","_type":"employee","_id":"2","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}

ES 7.x去掉了

我们一直认为ES中的“index”类似于关系型数据库的“database”,而“type”相当于一个数据表。ES的开发者们认为这是一个糟糕的认识。例如:关系型数据库中两个数据表示是独立的,即使他们里面有相同名称的列也不影响使用,但ES中不是这样的。

我们都知道elasticsearch是基于Lucene开发的搜索引擎,而ES中不同type下名称相同的filed最终在Lucene中的处理方式是一样的。举个例子,两个不同type下的两个user_name,在ES同一个索引下其实被认为是同一个filed,你必须在两个不同的type中定义相同的filed映射。否则,不同type中的相同字段名称就会在处理中出现冲突的情况,导致Lucene处理效率下降。

去掉type能够使数据存储在独立的index中,这样即使有相同的字段名称也不会出现冲突,就像ElasticSearch出现的第一句话一样“你知道的,为了搜索····”,去掉type就是为了提高ES处理数据的效率。

除此之外,在同一个索引的不同type下存储字段数不一样的实体会导致存储中出现稀疏数据,影响Lucene压缩文档的能力,导致ES查询效率的降低。

https://blog.csdn.net/can_do_it/article/details/84884757

https://blog.csdn.net/zjx546391707/article/details/78631394

  如果没有设置ID,则ES会自动生成一个。如:
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_version": 13,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"created": false
}
  _version代表更改的次数,一般来说,id不应该自动生成。

  文档存储在哪个shard中的公式如下:shard = hash(routing) % number_of_primary_shards

  routing默认是_id。
  默认情况下,replication=sync。默认情况下replica=1。
  会自动创建index megacorp,声明类型为employee,编号为1。

搜索文档

  GET http://localhost:9200/megacorp/employee/1
存在时如下:
{"_index":"megacorp","_type":"employee","_id":"1","_version":1,"found":true,"_source":{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}}
  _source中包含JSON原文档。
  http://localhost:9200/megacorp/employee/111
  不存在时如下:
  {"_index":"megacorp","_type":"employee","_id":"111","found":false}
  同时HTTP HEAD为404

查询指定字段

GET http: //localhost:9200/megacorp/employee/1?_source=first_name
	{
		"_index": "megacorp",
		"_type": "employee",
		"_id": "1",
		"_version": 13,
		"found": true,
		"_source": {
			"first_name": "John"
		}
	}

删除

DELETE http: //localhost:9200/megacorp/employee/111
	{
		"found": true,
		"_index": "megacorp",
		"_type": "employee",
		"_id": "1",
		"_version": 2,
		"result": "deleted",
		"_shards": {
			"total": 2,
			"successful": 1,
			"failed": 0
		}
	}

模糊搜索

  精确搜索就没有必要使用ES了,所以模糊搜索才是关键。

/_search {
	"took": 5,
	"timed_out": false,
	"_shards": {
		"total": 5,
		"successful": 5,
		"failed": 0
	},
	"hits": {
		"total": 5,
		"max_score": 1,
		"hits": [{
				"_index": "megacorp",
				"_type": "employee",
				"_id": "AV3Kp7BqVnBASvmzDScd",
				"_score": 1,
				"_source": {
					"first_name": "John",
					"last_name": "Smith",
					"age": 25,
					"about": "I love to go rock climbing",
					"interests": [
						"sports",
						"music"
					]
				}
			},
			{
				"_index": "megacorp",
				"_type": "employee",
				"_id": "2",
				"_score": 1,
				"_source": {
					"first_name": "Jane",
					"last_name": "Smith",
					"age": 32,
					"about": "I like to collect rock albums",
					"interests": [
						"music"
					]
				}
			},
			{
				"_index": "megacorp",
				"_type": "employee",
				"_id": "AV3Kp5hsVnBASvmzDScc",
				"_score": 1,
				"_source": {
					"first_name": "John",
					"last_name": "Smith",
					"age": 25,
					"about": "I love to go rock climbing",
					"interests": [
						"sports",
						"music"
					]
				}
			},
			{
				"_index": "megacorp",
				"_type": "employee",
				"_id": "1",
				"_score": 1,
				"_source": {
					"first_name": "John",
					"last_name": "Smith",
					"age": 25,
					"about": "I love to go rock climbing",
					"interests": [
						"sports",
						"music"
					]
				}
			},
			{
				"_index": "megacorp",
				"_type": "employee",
				"_id": "3",
				"_score": 1,
				"_source": {
					"first_name": "Douglas",
					"last_name": "Fir",
					"age": 35,
					"about": "I like to build cabinets",
					"interests": [
						"forestry"
					]
				}
			}
		]
	}
}

  默认情况下,hits返回符合条件的前面10行,_score从高到低。如果要分页,则需要加上:http://localhost:9200/megacorp/employee/_search?size=2&from=2

  搜索所有字段,真正的全文检索:http://localhost:9200/megacorp/employee/_search?q=John 在后台,其实是查询所有字段,内部有一个隐含的_all字段,类型为string。
  各种语法可以参考https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax

type/mapping的结构(mapping/模式定义)

GET search_doc_new_ic/_mapping   /* es7之前要用my_index/_mapping/my_type,7去掉了type*/
{
  "search_doc_new_ic" : {
    "mappings" : {
      "properties" : {
        "authors" : {
          "type" : "keyword"
        },
        "content" : {
          "properties" : {
            "page_no" : {
              "type" : "integer"
            },
            "paragraphs" : {
              "type" : "text",
              "index_options" : "offsets"
            }
          }
        },
        "doc_id" : {
          "type" : "keyword"
        },
        "doc_source" : {
          "type" : "keyword"
        },
        "file_id" : {
          "type" : "keyword"
        },
        "file_name" : {
          "type" : "keyword"
        },
        "fstore_group" : {
          "type" : "keyword"
        },
        "fstore_path" : {
          "type" : "keyword"
        },
        "industry_chain_nodes" : {
          "properties" : {
            "code" : {
              "type" : "keyword"
            },
            "name" : {
              "type" : "keyword"
            }
          }
        },
        "industry_chains" : {
          "properties" : {
            "code" : {
              "type" : "keyword"
            },
            "name" : {
              "type" : "keyword"
            }
          }
        },
        "industry_code" : {
          "type" : "keyword"
        },
        "industry_name" : {
          "type" : "keyword"
        },
        "invest_ranking" : {
          "type" : "keyword"
        },
        "local_path" : {
          "type" : "keyword"
        },
        "org_name" : {
          "type" : "keyword"
        },
        "page_count" : {
          "type" : "integer"
        },
        "publish_date" : {
          "type" : "date"
        },
        "pv" : {
          "type" : "integer"
        },
        "report_type" : {
          "type" : "keyword"
        },
        "risk_ranking" : {
          "type" : "keyword"
        },
        "secu_code" : {
          "type" : "keyword"
        },
        "secu_name" : {
          "type" : "keyword"
        },
        "sentiment" : {
          "type" : "integer"
        },
        "summary" : {
          "type" : "text",
          "index_options" : "offsets"
        },
        "title" : {
          "type" : "text",
          "index_options" : "offsets"
        }
      }
    }
  }
}

 


  es会自动推断最合适的类型,比如text/long/date。实际上ES也是强类型语义的,如果long被不恰当的定义为string,在全文检索时将导致非预期的结果。除了默认的定义外,field可以自定义mapping属性,通常是index(用于控制某字段支持精确匹配、模糊匹配还是不支持搜索)和analyzer(声明分析器)这两个属性。不过mapping不能修改,只能在创建时或者新增字段时指定。

  Lucene不支持存储null值。

ajax支持

  到config文件夹下的elasticsearch.yml,在文件的末尾添加如下内容:

http.cors.enabled: true
http.cors.allow-origin: "*"

  以便支持在web中通过ajax访问。


  query DSL和filter DSL区别:query用于全文检索并得到_score,filter用于精确匹配。
  text有精确匹配和全文搜索的区别,long/date以及_id则没有。
  Elasticsearch会为每个text field的每个单词建立inverted index索引。
  默认情况下,ES区分大小写,复数与非负数,实际上我们希望他们不敏感。还有中文的匹配搜索。这种情况,我们需要使用analyzer,默认的分析器是标准分析器,它基于UNICODE TEXT SEGMENTATION进行分析。ES原生支持的语言分析器包括https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html,其中不包括中文,所以默认每个汉字都是一个term。如果不希望某字段使用默认的分析器,必须通过在这些字段上声明mapping(也叫schema definition,也就是ddl的意思)来手工配置。
  使用DSL语言作为查询条件的格式,也就是JSON格式。所有的查询结果都会返回一个_score,表示匹配程度。

问题

  Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
  解决方法:http://blog.csdn.net/u011403655/article/details/71107415
https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html

  在cluster中,有一个节点被选为master节点,其负责集群内的全局管理,比如增加/删除index、节点,但是不管理具体的事情。

查看ES集群状态:

GET http: //localhost:9200/_cluster/health{
    "cluster_name": "elasticsearch",
    "status": "yellow",
    "timed_out": false,
    "number_of_nodes": 1,
    "number_of_data_nodes": 1,
    "active_primary_shards": 5,
    "active_shards": 5,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 5,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 50
}

  最重要的是status字段。取值为:

    green:All primary and replica shards are active.
  • yellow:All primary shards are active, but not all replica shards are active.(对于单节点的环境来说,replica shards没有什么意义)
  • red:Not all primary shards are active.

  启动第二个节点的时候,节点会自动加入相同名称的cluster.name集群。Elasticsearch能够在节点宕机后自动重新选举master shard,这样就可以重新提供服务了。

查看ES及lucene版本

{
  "name" : "t2ztM-f",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "DTTrGi_UR12p8Vbc9MTNAQ",
  "version" : {
    "number" : "6.3.2",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "053779d",
    "build_date" : "2018-07-20T05:20:23.451332Z",
    "build_snapshot" : false,
    "lucene_version" : "7.3.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Elasticsearch中,文档中的每个字段都被索引了,一个查询中。

元数据包括:

  • _id:唯一标识一个type内的文档

  默认情况下,ES基于相关性进行排序。如果要根据字段进行排序,则要指定如下:

GET / _search {
	"query": {
		"filtered": {
			"filter": {
				"term": {
					"user_id": 1
				}
			}
		}
	},
	"sort": {
		"date": {
			"order": "desc"
		}
	}
}

  如果排序不是基于相关性的话,_score不会被计算。计算_score的成本很高,所以指定了sort的话,默认不会计算_score,指定track_scores=true可以强行计算。

  多条件匹配,首先根据date,其次根据相关性。

GET / _search {
	"query": {
		"filtered": {
			"query": {
				"match": {
					"tweet": "manage text search"
				}
			},
			"filter": {
				"term": {
					"user_id": 2
				}
			}
		}
	},
	"sort": [{
			"date": {
				"order": "desc"
			}
		},
		{
			"_score": {
				"order": "desc"
			}
		}
	]
}

  对于全文搜索的字段,排序没有意义,一般用相关度。

  ES会将尽可能多的数据保存在内存中以提高性能。
  ES的查询称为分布式搜索查询,分为查询和提取两部分。在查询阶段,请求会广播给所有的shard,返回符合条件的top N,根据order by条件。

查看indices层面的状态

GET _cluster/health?level=indices
GET _cluster/health?level=shards
查看节点的状态:
http://localhost:9200/_nodes/stats

删除索引下的所有数据,但是不删除索引本身

POST http://10.20.30.193:9200/search_doc_new_ic/_delete_by_query?refresh
{ "query": { "match_all": {} } }
{
    "took": 147849,
    "timed_out": false,
    "total": 3789150,
    "deleted": 3789150,
    "batches": 3790,
    "version_conflicts": 0,
    "noops": 0,
    "retries": {
        "bulk": 0,
        "search": 0
    },
    "throttled_millis": 0,
    "requests_per_second": -1,
    "throttled_until_millis": 0,
    "failures": []
}

   需要注意的是,删除文档不会删除空间。

Elasticsearch学习笔记

 

查询时match、match_phrase、query_string三者的区别

  match相当于已经条件已经分词过,直接传递进去查找。对应pg中xxx::tsquery

  query_string相当于未分词过,传递的是原始文本,会先进行分析,然后和match一样。to_tsquery(xxx,xxx)

  match_phrase和match的区别是,match不是词组查询,只要包含即可,match_phrase有顺序要求。phraseto_tsquery(xxx,xxx)

注意点

  每个JVM内存不要超过32GB,最好在30GB以内(postgresql就没这个问题)、而且java中堆大了之后,GC也是个严重的问题,Elasticsearch和Lucene分别使用1/2的内存。前者使用JVM内存,后者使用OS的filesystem cache。不过如果这样配置的话,为了保证HA,需要设置初始化参数cluster.routing.allocation.same_shard.host:true,防止主和从shard分配到相同的机器。
  聚合是通过称为fielddata的数据结构完成的,Fielddata是Elasticsearch集群中内存的最大消耗者。所以必须完全理解它。
  Fielddata有点像RDBMS的数据块,只不过应该是行为单位的,会按需加载到内存。Fielddata存在的原因是因为inverted indices不是银弹,inverted indices擅长于找到包含某个分词(term)的文档,但是反过来,在某个文档中存在哪些个term就懵逼了,而聚合需要这种二次访问模式。

ES linux下安装

vi elasticsearch.yml
network.host: 0.0.0.0 否则只有本机才能访问
不能root用户执行,数据库如postgresql、oracle都如此。
groupadd es
useradd -g es es
[2016-12-20T22:37:28,552][ERROR][o.e.b.Bootstrap ] [elk-node1] node validation exception
bootstrap checks failed
解决:使用centos 7版本,就不会出现此类问题了。
system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk
原因:
  这是在因为Centos6不支持SecComp,而ES5.2.0默认bootstrap.system_call_filter为true进行检测,所以导致检测失败,失败后直接导致ES不能启动。
解决:
  在elasticsearch.yml中配置bootstrap.system_call_filter为false,注意要在Memory下面:
  bootstrap.memory_lock: false
  bootstrap.system_call_filter: false

ES 7报错

在启动ElasticSearch的过程中爆出了以下错误:

ERROR: [1] bootstrap checks failed
[1]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
修改
elasticsearch.yml
取消注释保留一个节点
cluster.initial_master_nodes: ["node-1"]

另一错误

[2021-09-18T22:15:24,063][ERROR][o.e.i.g.GeoIpDownloader  ] [node-1] exception during geoip databases update
java.net.ConnectException: Connection refused
    at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
    at sun.nio.ch.Net.pollConnectNow(Net.java:669) ~[?:?]
    at sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:549) ~[?:?]
    at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597) ~[?:?]
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:333) ~[?:?]
    at java.net.Socket.connect(Socket.java:645) ~[?:?]
    at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:300) ~[?:?]
    at sun.net.NetworkClient.doConnect(NetworkClient.java:177) ~[?:?]
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:497) ~[?:?]
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:600) ~[?:?]
    at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:265) ~[?:?]
    at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:379) ~[?:?]
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:189) ~[?:?]
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1232) ~[?:?]
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1120) ~[?:?]
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:175) ~[?:?]
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1653) ~[?:?]
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1577) ~[?:?]
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527) ~[?:?]
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:308) ~[?:?]
    at org.elasticsearch.ingest.geoip.HttpClient.lambda$get$0(HttpClient.java:55) ~[ingest-geoip-7.14.1.jar:7.14.1]
    at java.security.AccessController.doPrivileged(AccessController.java:554) ~[?:?]
    at org.elasticsearch.ingest.geoip.HttpClient.doPrivileged(HttpClient.java:97) ~[ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.HttpClient.get(HttpClient.java:49) ~[ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.HttpClient.getBytes(HttpClient.java:40) ~[ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.GeoIpDownloader.fetchDatabasesOverview(GeoIpDownloader.java:115) ~[ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.GeoIpDownloader.updateDatabases(GeoIpDownloader.java:103) ~[ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.GeoIpDownloader.runDownloader(GeoIpDownloader.java:235) [ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:94) [ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:43) [ingest-geoip-7.14.1.jar:7.14.1]
    at org.elasticsearch.persistent.NodePersistentTasksExecutor$1.doRun(NodePersistentTasksExecutor.java:40) [elasticsearch-7.14.1.jar:7.14.1]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.14.1.jar:7.14.1]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.14.1.jar:7.14.1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
    at java.lang.Thread.run(Thread.java:831) [?:?]

原因:此版本将GeoIp功能默认开启了采集。在默认的启动下是会去官网的默认地址下获取最新的Ip的GEO信息。
官方说明 geoip-processor

增加配置 ingest.geoip.downloader.enabled: false即可。

 

vi /etc/security/limits.conf
添加如下内容:

* soft nofile 65536
* hard nofile 131072
* soft nproc 2048
* hard nproc 4096

vi /etc/sysctl.conf
添加下面配置:
vm.max_map_count=655360
并执行命令:
sysctl -p
然后,重新启动elasticsearch,即可启动成功。

中文搜索安装

  elasticsearch-analysis-ik安装 拷贝到ES_HOME/plugins目录下,命名为ik即可。注意小版本也要完全匹配,否则启动报错。
  elasticsearch-analysis-pinyin安装 拷贝到ES_HOME/plugins目录下,命名为pinyin即可。支持自定义词库,https://blog.csdn.net/mingover/article/details/79166375
  elasticsearch-head的安装可见http://mobz.github.io/elasticsearch-head/,对于rhel 7/windows,没有问题,直接npm start即可。对于rhel 6,安装比较麻烦,特别是在nodejs和npm安装的时候,还要升级gcc到4.8,不然nodejs v6+安装不了,用0.6.x则npmjs各种麻烦。实际上也没什么用,cli都能查到必要的信息。

ES java写入

https://www.cnblogs.com/chenyuanbo/p/10296827.html

https://www.cnblogs.com/cjsblog/p/10232581.html

java报错

Caused by: java.lang.ClassNotFoundException: org.elasticsearch.client.Cancellable

原因及解决方法:版本冲突造成,在POM文件中指定版本就行

<properties>
  <java.version>1.8</java.version>
  <elasticsearch.version>7.14.1</elasticsearch.version>
</properties>

 

node settings must not contain any index level settings

 arguments.

 需要通过REST API设置索引的内容。例如,修改translog参数:

http://10.20.30.193:9200/_all/_settings?preserve_existing=true
{
"index.translog.durability":"async", "index.translog.sync_interval":"30s", "index.translog.flush_threshold_size":"1024mb" }

 

{
    "error": {
        "root_cause": [
            {
                "type": "resource_already_exists_exception",
                "reason": "index [search_doc_new_ic/JQR491ldTDKpNum4pWkl7g] already exists",
                "index_uuid": "JQR491ldTDKpNum4pWkl7g",
                "index": "search_doc_new_ic"
            }
        ],
        "type": "resource_already_exists_exception",
        "reason": "index [search_doc_new_ic/JQR491ldTDKpNum4pWkl7g] already exists",
        "index_uuid": "JQR491ldTDKpNum4pWkl7g",
        "index": "search_doc_new_ic"
    },
    "status": 400
}

 

有一种说法是先关闭索引,修改,再打开。但是这个不应该是原因。关闭之后,索引状态就成了unknown(和删除后的瞬间状态一样)。查看索引状态:

{"error":{"root_cause":[{"type":"index_closed_exception","reason":"closed","index_uuid":"ZGJeKccHTiyitcdgqvkVqQ","index":"search_doc_new_ic"}],"type":"index_closed_exception","reason":"closed","index_uuid":"ZGJeKccHTiyitcdgqvkVqQ","index":"search_doc_new_ic"},"status":400}

Elasticsearch学习笔记

什么时候需要关闭索引呢?

  有些操作必须先关闭索引,才能修改,例如修改索引的默认分词器。

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "Can't update non dynamic settings [[index.analysis.analyzer.default.type]] for open indices [[search_doc_new_ic/ga3Y8cBgR8iBbyZsgqltMw]]"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "Can't update non dynamic settings [[index.analysis.analyzer.default.type]] for open indices [[search_doc_new_ic/ga3Y8cBgR8iBbyZsgqltMw]]"
    },
    "status": 400
}

 POST http://10.20.30.193:9200/search_doc_new_ic/_close

XXX

 POST http://10.20.30.193:9200/search_doc_new_ic/_open

 

 elasticsearch开启慢日志记录

# 检查是否开启慢日志记录
GET /test/_settings


# 开启查询慢日志记录
PUT /test/_settings
{
    "index.search.slowlog.threshold.query.warn": "1000ms",
    "index.search.slowlog.threshold.query.info": "500ms",
    "index.search.slowlog.threshold.query.debug": "800ms",
    "index.search.slowlog.threshold.query.trace": "200ms",
    "index.search.slowlog.threshold.fetch.warn": "1000ms",
    "index.search.slowlog.threshold.fetch.info": "500ms",
    "index.search.slowlog.threshold.fetch.debug": "800ms",
    "index.search.slowlog.threshold.fetch.trace": "200ms",
    "index.search.slowlog.level": debug
}

# 开启索引慢日志记录
PUT /test/_settings
{
    "index.indexing.slowlog.threshold.index.warn": "1000ms",
    "index.indexing.slowlog.threshold.index.info": "500ms",
    "index.indexing.slowlog.threshold.index.debug": "500ms",
    "index.indexing.slowlog.threshold.index.trace": "500ms",
    "index.indexing.slowlog.level": debug,
    "index.indexing.slowlog.source": 1000
}

关闭慢日志

PUT /test/_settings
{
    "index.indexing.slowlog.threshold.index.warn": null,
    "index.indexing.slowlog.threshold.index.info": null,
    "index.indexing.slowlog.threshold.index.debug": null,
    "index.indexing.slowlog.threshold.index.trace": null,
    "index.indexing.slowlog.level": null,
    "index.indexing.slowlog.source": null
}

查看执行计划

GET shopping/_search
{
  "explain": true, 
  "query": {
    "match": {
        "goodsInfoName": "苏泊尔"
    }
  }
}

输出如下

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 24,
      "relation" : "eq"
    },
    "max_score" : 5.3067513,
    "hits" : [
      {
        "_shard" : "[shopping][1]",
        "_node" : "h665-yAdSzGgjxamBh5CjA",
        "_index" : "shopping",
        "_type" : "_doc",
        "_id" : "10976",
        "_score" : 5.3067513,
        "_source" : {
           "goodsInfoName" : "苏泊尔不锈钢压力锅高压锅YS22ED+苏泊尔保鲜盒饭盒便当盒330mlKB033AE1(银色)",
           "其他字段省略....."
        },
        "_explanation" : {
          "value" : 5.3067513,
          "description" : "weight(goodsInfoName:苏泊尔 in 328) [PerFieldSimilarity], result of:",
          "details" : [
            {
              "value" : 5.3067513,
              "description" : "score(freq=2.0), computed as boost * idf * tf from:",
              "details" : [
                {
                  "value" : 2.2,
                  "description" : "boost",
                  "details" : [ ]
                },
                {
                  "value" : 3.6549778,
                  "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details" : [
                    {
                      "value" : 10,
                      "description" : "n, number of documents containing term",
                      "details" : [ ]
                    },
                    {
                      "value" : 405,
                      "description" : "N, total number of documents with field",
                      "details" : [ ]
                    }
                  ]
                },
                {
                  "value" : 0.65996563,
                  "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  "details" : [
                    {
                      "value" : 2.0,
                      "description" : "freq, occurrences of term within document",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.2,
                      "description" : "k1, term saturation parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.75,
                      "description" : "b, length normalization parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 11.0,
                      "description" : "dl, length of field",
                      "details" : [ ]
                    },
                    {
                      "value" : 13.553086,
                      "description" : "avgdl, average length of field",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }

性能优化

  禁用不需要索引的字段。设置属性"index":"not_analyzed"(只支持精确匹配,适合于日期、数字字段,5.x版本开始,也可以设置类型为keyword表示不分词)。最为重要。

  禁用_all字段。

  对字段不分词,或者不索引,可以节省很多运算,降低 CPU 占用.尤其是 binary 类型,默认情况下占用 CPU 非常高,而这种类型根本不需要进行分词做索引。单个 doc 在建立索引时的运算复杂度,最大的因素 不在于 doc 的字节数或者说某个字段 value 的长度,而是字段的数量. 例如在满负载的写入压力测试中,mapping 相同的情况下,一个有10个字段,200字节的 doc, 通过增加某些字段 value 的长度到500字节,写入 es 时速度下降很少,而如果字段数增加到20,即使整个 doc 字节数没增加多少,写入速度也会降低一倍。

索引

  默认情况下,ES索引默认情况下每秒钟刷新一次。因为数据插入到ES时候,先到了in-memory buffer,此时是对外不可见的。只有被索引(分词和建立反转索引的过程)之后,才对外可见。一般会调整为30s或更多,具体多少合适,要看目标机器index的速度以及插入的TPS。设置为-1不意味着不索引了,只是索引是个被动的过程,当translog满了之后,还是会索引的。可见https://stackoverflow.com/questions/36449506/what-exactly-does-1-refresh-interval-in-elasticsearch-mean

  indices.memory.index_buffer_size: 10%* -Xmx

事务日志优化

  index.translog.durability:aysnc

  index.translog.sync_interval:120s

  index.translog.flush_threshold_size:1024mb

分片数量控制

  分片越大,索引速度越慢,尤其是单个分片超过几十GB后。

 

 

相关文章: