如何解决集群 ejabberd 环境中的 Mnesia-consistent_database 错误？答案

【问题标题】：How to solve Mnesia - inconsistent_database error in clustered ejabberd environment?如何解决集群 ejabberd 环境中的 Mnesia-consistent_database 错误？
【发布时间】：2023-04-10 20:48:02
【问题描述】：

我们设置了一个由两台主机组成的 ejabberd 集群，在主机重启期间我们遇到了问题。我们在登录时看到了不一致的数据库错误。但是，我们无法最终分析配置或 module_init 执行中的哪些内容可能实际导致该行为。删除 node1 上的 mnesia 可能有助于解决问题。然而，出于管理目的，它并不可取。

希望请求审查以下数据以及一些配置和反馈，了解实际可能导致该行为的原因以及如何缓解该行为。

提前谢谢你。

环境配置如下：

Ejabberd 版本：16.03
主机数量：2
odbc_type : MySQL

错误记录：

    ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, other_node}

复制步骤：

重启node1
重启node2

注意：如果主机以相反的顺序重新启动，它不会重现。

MnesiaInfo：

在任一节点上似乎有两个具有不同条目大小和可能内容的模式： muc_online_room 和我们的自定义架构在下面重命名为 SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME：

节点1：

---> Processes holding locks <--- 
---> Processes waiting for locks <--- 
---> Participant transactions <--- 
---> Coordinator transactions <---
---> Uncertain transactions <--- 
---> Active tables <--- 
mod_register_ip: with 0        records occupying 299      words of mem
muc_online_room: with 348      records occupying 10757    words of mem
http_bind      : with 0        records occupying 299      words of mem
carboncopy     : with 0        records occupying 299      words of mem
oauth_token    : with 0        records occupying 299      words of mem
session        : with 0        records occupying 299      words of mem
session_counter: with 0        records occupying 299      words of mem
sql_pool       : with 10       records occupying 439      words of mem
route          : with 4        records occupying 405      words of mem
iq_response    : with 0        records occupying 299      words of mem
temporarily_blocked: with 0        records occupying 299      words of mem
s2s            : with 0        records occupying 299      words of mem
route_multicast: with 0        records occupying 299      words of mem
shaper         : with 2        records occupying 321      words of mem
access         : with 28       records occupying 861      words of mem
acl            : with 6        records occupying 459      words of mem
local_config   : with 32       records occupying 1293     words of mem
schema         : with 19       records occupying 2727     words of mem
SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME     : with 2457     records occupying 49953    words of mem
===> System info in version "4.12.5", debug level = none <===
opt_disc. Directory "SCRUBBED_LOCATION" is used.
use fallback at restart = false
running db nodes   = [SCRUBBED_NODE2,SCRUBBED_NODE1]
stopped db nodes   = [] 
master node tables = []
remote             = []
ram_copies         = [access,acl,carboncopy,http_bind,iq_response,
                      local_config,mod_register_ip,muc_online_room,route,
                      route_multicast,s2s,session,session_counter,shaper,
                      sql_pool,temporarily_blocked,SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME]
disc_copies        = [oauth_token,schema]
disc_only_copies   = []
[{'SCRUBBED_NODE1',disc_copies},
 {'SCRUBBED_NODE2',disc_copies}] = [schema,
                                                                  oauth_token]
[{'SCRUBBED_NODE1',ram_copies}] = [local_config,
                                                                 acl,access,
                                                                 shaper,
                                                                 sql_pool,
                                                                 mod_register_ip]
[{'SCRUBBED_NODE1',ram_copies},
 {'SCRUBBED_NODE2',ram_copies}] = [route_multicast,
                                                                 s2s,
                                                                 temporarily_blocked,
                                                                 iq_response,
                                                                 route,
                                                                 session_counter,
                                                                 session,
                                                                 carboncopy,
                                                                 http_bind,
                                                                 muc_online_room,
                                                                 SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME]
2623 transactions committed, 35 aborted, 26 restarted, 60 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok

节点2：

mnesia:info().
---> Processes holding locks <--- 
---> Processes waiting for locks <--- 
---> Participant transactions <--- 
---> Coordinator transactions <---
---> Uncertain transactions <--- 
---> Active tables <--- 
mod_register_ip: with 0        records occupying 299      words of mem
muc_online_room: with 348      records occupying 8651     words of mem
http_bind      : with 0        records occupying 299      words of mem
carboncopy     : with 0        records occupying 299      words of mem
oauth_token    : with 0        records occupying 299      words of mem
session        : with 0        records occupying 299      words of mem
session_counter: with 0        records occupying 299      words of mem
route          : with 4        records occupying 405      words of mem
sql_pool       : with 10       records occupying 439      words of mem
iq_response    : with 0        records occupying 299      words of mem
temporarily_blocked: with 0        records occupying 299      words of mem
s2s            : with 0        records occupying 299      words of mem
route_multicast: with 0        records occupying 299      words of mem
shaper         : with 2        records occupying 321      words of mem
access         : with 28       records occupying 861      words of mem
acl            : with 6        records occupying 459      words of mem
local_config   : with 32       records occupying 1293     words of mem
schema         : with 19       records occupying 2727     words of mem
SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME     : with 2457     records occupying 38232    words of mem
===> System info in version "4.12.5", debug level = none <===
opt_disc. Directory "SCRUBBED_LOCATION" is used.
use fallback at restart = false
running db nodes   = ['SCRUBBED_NODE1','SCRUBBED_NODE2']
stopped db nodes   = [] 
master node tables = []
remote             = []
ram_copies         = [access,acl,carboncopy,http_bind,iq_response,
                      local_config,mod_register_ip,muc_online_room,route,
                      route_multicast,s2s,session,session_counter,shaper,
                      sql_pool,temporarily_blocked,SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME]
disc_copies        = [oauth_token,schema]
disc_only_copies   = []
[{'SCRUBBED_NODE1',disc_copies},
 {'SCRUBBED_NODE2',disc_copies}] = [schema,
                                                                  oauth_token]
[{'SCRUBBED_NODE1',ram_copies},
 {'SCRUBBED_NODE2',ram_copies}] = [route_multicast,
                                                                 s2s,
                                                                 temporarily_blocked,
                                                                 iq_response,
                                                                 route,
                                                                 session_counter,
                                                                 session,
                                                                 carboncopy,
                                                                 http_bind,
                                                                 muc_online_room,
                                                                 SCRUBBED_CUSTOM_FEATURE_SCHEMA_NAME]
[{'SCRUBBED_NODE2',ram_copies}] = [local_config,
                                                                 acl,access,
                                                                 shaper,
                                                                 sql_pool,
                                                                 mod_register_ip]
2998 transactions committed, 18 aborted, 0 restarted, 99 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok

【问题讨论】：

标签： ejabberd mnesia

【解决方案1】：

注意：如果主机以相反的顺序重新启动，它不会重现。

不一致的数据库是为了保护数据。如果您以一种顺序停止集群，则必须以相反的顺序重新启动它。否则，第一个节点停止，将记录有其他活动节点以及最新信息以防止数据丢失。

【讨论】：

感谢您的关注，Mickaël。
当我使用术语重启时，我指的是停止和启动同一个节点。在我们的环境中，我们可以随时重启第二个节点，但是为了优雅地重启第一个节点，第二个节点需要关闭。 Stop02、stop01、start01、start02 与 stop 01、stop 02、start 01、start 02 一样有效。但是，stop01、stop02、start02、start01 不起作用。我想得出这样的结论，node01 是某种需要首先重新启动的集群主机。我们想要重启节点的原因是保持节点实例可用以避免停机。
我们的系统工程师建议的替代方法是从集群中删除节点，进行更改并重新加入它们，以支付开销，因为订单管理也需要注意，如果主机是只是失败并且没有反应。我认为更好的改写问题是“通过在非系统级别的重大更改时随时保持集群运行来重新启动的正确方法是什么？”
如果您只想逐个节点更新并保持集群正常运行，只需停止然后重新启动一个节点即可。不要停止整个集群。