为什么一个简单的 ALTER 语句会导致 galera 崩溃答案

【问题标题】：Why does galera crash by a simple ALTER -statement为什么一个简单的 ALTER 语句会导致 galera 崩溃
【发布时间】：2019-01-15 14:16:15
【问题描述】：

我有一个 Mariadb 10.2.14 5 节点 Galera 服务器。简单直接的数据库差不多20G。没有触发器。很多索引和外键。我尝试在其中一个多主机上通过命令行 MySQL 更改一个空表或小表（添加一个字段），然后整个集群崩溃，为什么？我在其他 Galera 系统上从未遇到过这个问题。 RedHat 6.10 是操作系统。

有人可以帮忙吗？这是其中一台服务器上的错误日志：

使用简单的 alter 语句更新简单表时，5 节点多主机 Galera 停止工作，表损坏。对于不同的表和简单的变更语句（没有触发器），这种情况已经发生了好几次。

mysql-errorlog 显示如下：

2019-01-15 10:47:19 140487941920512 [Note] WSREP: Member 1.0 (server.company.local) synced with group.
2019-01-15 11:07:45 140487941920512 [Note] WSREP: Member 1.0 (server.company.local) desyncs itself from group
2019-01-15 11:07:46 140487941920512 [Note] WSREP: Member 1.0 (server.company.local) resyncs itself to group
2019-01-15 11:07:46 140487941920512 [Note] WSREP: Member 1.0 (server.company.local) synced with group.
2019-01-15 11:27:40 140487941920512 [Note] WSREP: Member 1.0 (server.company.local) desyncs itself from group
2019-01-15 11:27:41 140487941920512 [Note] WSREP: Member 1.0 (server.company.local) resyncs itself to group
2019-01-15 11:27:41 140487941920512 [Note] WSREP: Member 1.0 (server.company.local) synced with group.
2019-01-15 11:47:23 140487941920512 [Note] WSREP: Member 1.0 (server.company.local) desyncs itself from group
2019-01-15 11:47:24 140487941920512 [Note] WSREP: Member 1.0 (server.company.local) resyncs itself to group
2019-01-15 11:47:24 140487941920512 [Note] WSREP: Member 1.0 (server.company.local) synced with group.
2019-01-15 12:24:39 140452405958400 [Note] WSREP: MDL BF-BF conflict

schema:  databasename
request: (8227134       seqno 46874664  wsrep (2, 1, 0) cmd 3 3         ALTER TABLE `aagenda` ADD `id_subject_cat` int(11) NULL DEFAULT '0' AFTER `id_subject`, ADD INDEX `id_s$
granted: (15    seqno 46874665  wsrep (1, 0, 0) cmd 0 147       (null))
2019-01-15 12:24:40 140452405958400 [Note] WSREP: MDL BF-BF conflict
schema:  databasename
request: (8227134       seqno 46874664  wsrep (2, 1, 0) cmd 3 3         ALTER TABLE `aagenda` ADD `id_subject_cat` int(11) NULL DEFAULT '0' AFTER `id_subject`, ADD INDEX `id_s$
granted: (15    seqno 46874665  wsrep (1, 0, 0) cmd 0 147       (null))
2019-01-15 12:24:40 140452405958400 [Note] WSREP: MDL BF-BF conflict
schema:  databasename
request: (8227134       seqno 46874664  wsrep (2, 1, 0) cmd 3 3         ALTER TABLE `aagenda` ADD `id_subject_cat` int(11) NULL DEFAULT '0' AFTER `id_subject`, ADD INDEX `id_s$
granted: (11    seqno 46874666  wsrep (1, 0, 0) cmd 0 147       (null))
2019-01-15 12:24:40 0x7fbd9fc3d700  InnoDB: Assertion failure in file /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/row/row0merge.cc l$

InnoDB: Failing assertion: table->get_ref_count() == 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/xtradbinnodb-recovery-modes/
InnoDB: about forcing recovery.

190115 12:24:40 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.2.14-MariaDB-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=837
max_threads=1502
thread_count=280

It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 3431472 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fbe2d906c18
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went terribly wrong...
stack_bottom = 0x7fbd9fc3cd80 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2b)[0x55f4e00d8fab]
/usr/sbin/mysqld(handle_fatal_signal+0x535)[0x55f4dfbad005]
/lib64/libpthread.so.0(+0xf7e0)[0x7fc5f97f67e0]
/lib64/libc.so.6(gsignal+0x35)[0x7fc5f7e50495]
/lib64/libc.so.6(abort+0x175)[0x7fc5f7e51c75]
/usr/sbin/mysqld(+0x47c4eb)[0x55f4df97a4eb]
/usr/sbin/mysqld(+0x90edcc)[0x55f4dfe0cdcc]
/usr/sbin/mysqld(+0x873236)[0x55f4dfd71236]
/usr/sbin/mysqld(_Z17mysql_alter_tableP3THDPcS1_P14HA_CREATE_INFOP10TABLE_LISTP10Alter_infojP8st_orderb+0x29ed)[0x55f4dfab181d]
/usr/sbin/mysqld(_ZN19Sql_cmd_alter_table7executeEP3THD+0x3ae)[0x55f4dfaf62fe]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0xf81)[0x55f4dfa2b251]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x29a)[0x55f4dfa327ca]
/usr/sbin/mysqld(+0x5348c0)[0x55f4dfa328c0]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x18cd)[0x55f4dfa346fd]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x16e)[0x55f4dfa350ee]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x16f)[0x55f4dfaf335f]
/usr/sbin/mysqld(handle_one_connection+0x44)[0x55f4dfaf3484]
/lib64/libpthread.so.0(+0x7aa1)[0x7fc5f97eeaa1]
/lib64/libc.so.6(clone+0x6d)[0x7fc5f7f06bdd]

Trying to get some variables.

Some pointers may be invalid and cause the dump to abort.

Query (0x7fbe2d9141f0): is an invalid pointer

Connection ID (thread ID): 8227134
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_push$

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
We think the query pointer is invalid, but we will try to print it anyway.

Query: ALTER TABLE `aagenda` ADD `id_subject_cat` int(11) NULL DEFAULT '0' AFTER `id_subject`, ADD INDEX `id_subject_cat` (`id_subject_cat`)

【问题讨论】：

这个问题可能更适合dba.stackexchange.com
您使用的是 TOI（本案例首选）还是 RSU？
向 MariaDB 或 Codership 提交错误。
根据assertion failure的信息，这件事似乎已经reported给MariaDB了。
@Rick James：我使用了 TOI，但没有区别，五个节点中有两个崩溃了，并且更改后的表损坏了。错误已关闭，可能在下一个版本中解决。我得到的建议：如果您想在 Galera 生产环境中更改表而不停机，请在每个服务器上执行此操作：SET GLOBAL wsrep_desync = TRUE;设置会话 wsrep_on = FALSE； --- 更改语句 --- 设置会话 wsrep_on = TRUE;设置全局 wsrep_desync = FALSE；但是表结构必须向后兼容——应用程序可以使用。

标签： mariadb galera

【解决方案1】：

我从 MariaDB 得到的建议：

如果您想在 Galera 生产环境中更改表而不停机，请按节点执行此操作：

SET GLOBAL wsrep_desync = TRUE; (OR SET GLOBAL wsrep_desync = ON;)
SET SESSION wsrep_on = FALSE; (OR SET GLOBAL wsrep_on= OFF ;)

--- ALTER STATEMENT --- 

SET SESSION wsrep_on = TRUE; (OR SET GLOBAL wsrep_on= ON ;) 
SET GLOBAL wsrep_desync = FALSE; (OR SET GLOBAL wsrep_desync = OFF;)

但是表结构必须向后兼容——应用程序可以使用，否则你必须停止集群，然后在一个节点上改变你的表，重新启动集群。

【讨论】：