【发布时间】:2014-09-26 15:33:14
【问题描述】:
环境
- mongo db 2.6.3
- CentOS 6.5 版(最终版)
- Java 1.7
- 雄猫 7
- Jmeter 2.11
- 亚马逊 ec2
我们的 mongo db 托管在 amazon ec2 中。 我们基于recommended production architecture设置了我们的服务器,如下:
- 3 个配置服务器
- 2个mongos和tomcats一起跑
- 2个mongod,是一个主备副本(shard 1)的副本集
我们目前正在使用 3500 个并发用户对我们的应用程序进行负载测试。我们的应用程序消息传递(写入)繁重,因此我们目前正在试验 2 个数据库,一个用于用户,另一个用于消息。 当我们有单个数据库(用户,消息作为集合)时,平均响应时间为 2.3 秒,但错误率几乎为 0.00%。 当我们有 2 个 dbs 一个有用户,另一个有消息时,平均响应时间为 1.1 秒,但错误率更高(0.16%)
当我们检查 tomcat(应用服务器日志)时,我们发现了很多类似以下的错误:
~ 88% 的错误:
{ "serverUsed" : "localhost:27017" , "ok" : 1 , "n" : 0 , "err" : "write results unavailable from shard01-primary.mycompanys.com:27018 :: caused by :: Location13328 sharded connection pool: connect failed shard01-primary.mycompanys.com:27018 : couldn't connect to server shard01-primary.mycompanys.com:27018 (10.0.1.111), connection attempt failed" , "code" : 83}
~5.5% 的错误:
ReplicaSetMonitor no master found for set: shard01
~2.2% 的错误:
{ "serverUsed" : "localhost:27017" , "ok" : 1 , "n" : 0 , "err" : "could not contact primary for replica set shard01" , "code" : 7}
但是在抛出错误时,副本的主副本 (shard01-primary.mycompanys.com) 正在运行。
shard01:PRIMARY> rs.status()
{
"set" : "shard01",
"date" : ISODate("2014-08-04T08:57:59Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "shard01-primary.mycompanys.com:27018",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 1032189,
"optime" : Timestamp(1406913104, 6),
"optimeDate" : ISODate("2014-08-01T17:11:44Z"),
"electionTime" : Timestamp(1406110686, 1),
"electionDate" : ISODate("2014-07-23T10:18:06Z"),
"self" : true
},
{
"_id" : 1,
"name" : "shard01-secondary.mycompanys.com:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 1032005,
"optime" : Timestamp(1406913104, 6),
"optimeDate" : ISODate("2014-08-01T17:11:44Z"),
"lastHeartbeat" : ISODate("2014-08-04T08:57:57Z"),
"lastHeartbeatRecv" : ISODate("2014-08-04T08:57:57Z"),
"pingMs" : 0,
"syncingTo" : "shard01-primary.mycompanys.com:27018"
}
],
"ok" : 1
}
连接池设置如下:
db.connections.max=5000
db.connections.min=5000
感谢任何有关修复错误的指针。
为回答马库斯而更新
你有两个成员的副本集?
是的,我们有一个 2 成员副本集(主要的,次要的)。这是我们的 shard01。
您使用彩信监控吗?
是的,我们有。但我们可以为您提供 sh.status()
mongos> sh.status({verbose:true})
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("53cf92e43476cd1989296134")
}
shards:
{ "_id" : "shard01-sh", "host" : "shard01/shard01-primary.mycompanys.com:27018,shard01-secondary.mycompanys.com:27018" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "my_app", "partitioned" : false, "primary" : "shard01-sh" }
{ "_id" : "test", "partitioned" : false, "primary" : "shard01-sh" }
{ "_id" : "my_app_load1", "partitioned" : true, "primary" : "shard01-sh" }
my_app_load1.users
shard key: { "_id" : 1 }
chunks:
shard01-sh 13
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : ObjectId("119de91b3e18488b70e497a0") } on : shard01-sh Timestamp(1, 1)
{ "_id" : ObjectId("119de91b3e18488b70e497a0") } -->> { "_id" : ObjectId("26b5524ea883044d602a56f0") } on : shard01-sh Timestamp(1, 17)
{ "_id" : ObjectId("26b5524ea883044d602a56f0") } -->> { "_id" : ObjectId("3c2659b4eb7ae237566420e4") } on : shard01-sh Timestamp(1, 18)
{ "_id" : ObjectId("3c2659b4eb7ae237566420e4") } -->> { "_id" : ObjectId("5b4be31feb7ae97c1e42e0e4") } on : shard01-sh Timestamp(1, 13)
{ "_id" : ObjectId("5b4be31feb7ae97c1e42e0e4") } -->> { "_id" : ObjectId("6af6d205a883028e0c17d6f0") } on : shard01-sh Timestamp(1, 23)
{ "_id" : ObjectId("6af6d205a883028e0c17d6f0") } -->> { "_id" : ObjectId("7c2752cbeb7aefbc1ff6f0e4") } on : shard01-sh Timestamp(1, 24)
{ "_id" : ObjectId("7c2752cbeb7aefbc1ff6f0e4") } -->> { "_id" : ObjectId("954759cceb7aea12f666f0e4") } on : shard01-sh Timestamp(1, 15)
{ "_id" : ObjectId("954759cceb7aea12f666f0e4") } -->> { "_id" : ObjectId("b1de2d00eb7ae972f93180e4") } on : shard01-sh Timestamp(1, 16)
{ "_id" : ObjectId("b1de2d00eb7ae972f93180e4") } -->> { "_id" : ObjectId("c3d81bbca8830722302a5420") } on : shard01-sh Timestamp(1, 21)
{ "_id" : ObjectId("c3d81bbca8830722302a5420") } -->> { "_id" : ObjectId("d642db1ac29660293b70e497") } on : shard01-sh Timestamp(1, 22)
{ "_id" : ObjectId("d642db1ac29660293b70e497") } -->> { "_id" : ObjectId("e8afdf84a883072ba6e88420") } on : shard01-sh Timestamp(1, 19)
{ "_id" : ObjectId("e8afdf84a883072ba6e88420") } -->> { "_id" : ObjectId("fd1771c93e1847d350e497a0") } on : shard01-sh Timestamp(1, 20)
{ "_id" : ObjectId("fd1771c93e1847d350e497a0") } -->> { "_id" : { "$maxKey" : 1 } } on : shard01-sh Timestamp(1, 4)
{ "_id" : "my_app_inbox_load1", "partitioned" : true, "primary" : "shard01-sh" }
my_app_inbox_load1.inbox
shard key: { "receiver_id" : 1 }
chunks:
shard01-sh 20
{ "receiver_id" : { "$minKey" : 1 } } -->> { "receiver_id" : "0003fd94eb7aed675be420e4" } on : shard01-sh Timestamp(1, 17)
{ "receiver_id" : "0003fd94eb7aed675be420e4" } -->> { "receiver_id" : "154b48b2eb7ae977588b70e4" } on : shard01-sh Timestamp(1, 19)
{ "receiver_id" : "154b48b2eb7ae977588b70e4" } -->> { "receiver_id" : "26022e7eeb7aefb6ea5ac0e4" } on : shard01-sh Timestamp(1, 23)
{ "receiver_id" : "26022e7eeb7aefb6ea5ac0e4" } -->> { "receiver_id" : "37f8d531c296675666f0e497" } on : shard01-sh Timestamp(1, 24)
{ "receiver_id" : "37f8d531c296675666f0e497" } -->> { "receiver_id" : "41bcd983a883072cd2fc96f0" } on : shard01-sh Timestamp(1, 37)
{ "receiver_id" : "41bcd983a883072cd2fc96f0" } -->> { "receiver_id" : "4cfd5606eb7aecd6ed2420e4" } on : shard01-sh Timestamp(1, 38)
{ "receiver_id" : "4cfd5606eb7aecd6ed2420e4" } -->> { "receiver_id" : "622680c0eb7aecd6e88ac0e4" } on : shard01-sh Timestamp(1, 21)
{ "receiver_id" : "622680c0eb7aecd6e88ac0e4" } -->> { "receiver_id" : "6df5ff8aeb7aea143936f0e4" } on : shard01-sh Timestamp(1, 25)
{ "receiver_id" : "6df5ff8aeb7aea143936f0e4" } -->> { "receiver_id" : "80aabb00eb7ae237593590e4" } on : shard01-sh Timestamp(1, 26)
{ "receiver_id" : "80aabb00eb7ae237593590e4" } -->> { "receiver_id" : "8ad740cbeb7aecddaff590e4" } on : shard01-sh Timestamp(1, 33)
{ "receiver_id" : "8ad740cbeb7aecddaff590e4" } -->> { "receiver_id" : "95e04ae3eb7aecd58be6f0e4" } on : shard01-sh Timestamp(1, 34)
{ "receiver_id" : "95e04ae3eb7aecd58be6f0e4" } -->> { "receiver_id" : "9fd32b25eb7aeba6ea5030e4" } on : shard01-sh Timestamp(1, 31)
{ "receiver_id" : "9fd32b25eb7aeba6ea5030e4" } -->> { "receiver_id" : "b05d1766eb7aecd7588590e4" } on : shard01-sh Timestamp(1, 32)
{ "receiver_id" : "b05d1766eb7aecd7588590e4" } -->> { "receiver_id" : "bab06fdfeb7ae8c587dac0e4" } on : shard01-sh Timestamp(1, 29)
{ "receiver_id" : "bab06fdfeb7ae8c587dac0e4" } -->> { "receiver_id" : "c8dbfa5feb7aee075be590e4" } on : shard01-sh Timestamp(1, 30)
{ "receiver_id" : "c8dbfa5feb7aee075be590e4" } -->> { "receiver_id" : "d4471acdeb7ae8c4388420e4" } on : shard01-sh Timestamp(1, 27)
{ "receiver_id" : "d4471acdeb7ae8c4388420e4" } -->> { "receiver_id" : "e53cf32d3e184ff180e497a0" } on : shard01-sh Timestamp(1, 28)
{ "receiver_id" : "e53cf32d3e184ff180e497a0" } -->> { "receiver_id" : "eecfd315a88305f2375ff6f0" } on : shard01-sh Timestamp(1, 35)
{ "receiver_id" : "eecfd315a88305f2375ff6f0" } -->> { "receiver_id" : "ffd9ee77c296619a52e0e497" } on : shard01-sh Timestamp(1, 36)
{ "receiver_id" : "ffd9ee77c296619a52e0e497" } -->> { "receiver_id" : { "$maxKey" : 1 } } on : shard01-sh Timestamp(1, 4)
时间表?
它发生在测试运行中间的某个地方。我们已经运行了 2 次相同的测试,但它们产生了相同的错误率。我们刚刚清理了运行之间的数据,因此在第二次运行发生时分片和块已经存在(由第一次运行创建)。
【问题讨论】:
-
几个问题:你有一个两个成员的副本集?您使用彩信监控吗?或者您至少可以为我们提供输出 sh.status() 吗?还请给出进行负载测试的准确时间范围以及发生此类错误的时间。
-
@MarkusWMahlberg 我已经通过更新问题回答了您的问题。如果上述信息不充分,我可以通过重新运行测试为您提供有关时间范围的更多信息。
-
我需要确切的时间范围,即测试的开始时间和结束时间,以便将其与您的状态和选举相关联。
标签: java mongodb tomcat amazon-ec2 performance-testing