pcp_recovery_node 命令在恢复备用时挂起答案

【问题标题】：pcp_recovery_node command hangs while recovering standbypcp_recovery_node 命令在恢复备用时挂起
【发布时间】：2019-03-19 08:59:27
【问题描述】：

它是 cluster 的子部分，我正在构建。当我在 master 上执行 pcp_recovery_node 以使用命令从头开始构建备用时

pcp_recovery_node -h 193.185.83.119 -p 9898 -U postgres -n 1

这里，193.185.83.119 是贵宾。它成功地在 node-b 上构建并启动了备用服务器（比如节点是 node-a 和 node-b），但同时上面的命令没有返回，只是挂在 shell 中，如下所示：-

[postgres@rollc-filesrvr1 数据]$ pcp_recovery_node -h 193.185.83.119 -p 9898 -U postgres -n 1 密码：

我必须使用 ctrl+c 退出此会话。稍后当我尝试在 node-a (master) 上创建测试数据库时，出现以下错误：

      postgres=# create database test;
        ERROR:  source database "template1" is being accessed by other users
        DETAIL:  There is 1 other session using the database.

我确认在 node-a 上运行此命令时 pgpool.service 正在运行，并且我已尝试在发出 pcp 命令之前在 node-b（备用）上使用 on/off pgpool.service。结果还是一样。

我还尝试了谷歌搜索并调整了 pgpool.conf 中的以下设置。我不确定是否可能与这些参数有关：

pgpool.conf 中的 wd_lifecheck_dbname

最初与上述相关的设置是（我得到的结果仍然相同）：

wd_lifecheck_dbname = 'template1'
wd_lifecheck_user = 'nobody'
wd_lifecheck_password = ''

后来，我在here、here 找到了不同的设置，在here 找到了一个建议，并尝试了不同的组合，如下所示：

wd_lifecheck_dbname = 'template1'
wd_lifecheck_user = 'postgres'
wd_lifecheck_password = ''

或

wd_lifecheck_dbname = 'postgres'
wd_lifecheck_user = 'postgres'
wd_lifecheck_password = ''

但是没有人帮助改变 shell 上的情况，也不允许我在 master 上创建测试数据库。我觉得，我走到了死胡同。

我仍然无法完全理解 pgpool 中上述 3 个参数的目的和含义，并且不知何故怀疑这些是我配置不正确的参数，尽管也可能有其他原因。

只是为了帮助，这里又是环境细节。

node-a 和 nod-b 环境：rhel 7.6
postgres 版本：10.7
pgpool-||版本：4.0.3
复制槽 + wal 存档

这是来自 node-a pgpool.service 的日志

Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 16642: LOG:  forked new pcp worker, pid=8534 socket=7
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: LOG:  starting recovering node 1
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: LOG:  executing recovery
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: DETAIL:  starting recovery command: "SELECT pgpool_recovery('recovery_1st_stage', 'node-a-ip', '/data/test/data', '5438', 1)"
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: LOG:  executing recovery
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: DETAIL:  disabling statement_timeout
Mar 18 21:10:18 node-a pgpool[16583]: 2019-03-18 21:10:18: pid 8534: LOG:  node recovery, 1st stage is done
Mar 18 21:11:37 node-a pgpool[16583]: 2019-03-18 21:11:37: pid 8534: LOG:  checking if postmaster is started
Mar 18 21:11:37 node-a pgpool[16583]: 2019-03-18 21:11:37: pid 8534: DETAIL:  trying to connect to postmaster on hostname:node-b-ip database:postgres user:postgres (retry 0 times)
...
...2 more times 
Mar 18 21:11:49 node-a pgpool[16583]: 2019-03-18 21:11:49: pid 8534: LOG:  checking if postmaster is started
Mar 18 21:11:49 node-a pgpool[16583]: 2019-03-18 21:11:49: pid 8534: DETAIL:  trying to connect to postmaster on hostname:node-a-ip database:template1 user:postgres (retry 0 times)
...it keeps on trying till i press ctrl+c on pcp command windows . I have seen it going upto 30 or more.

在使用 pgpool 检查时，node-b 也永远不会显示为 up。

postgres=> 显示 pool_nodes; 节点ID |主机名 |港口|状态 |磅重量 |角色 |选择_cnt |负载平衡节点 |复制延迟 | last_status_change ---------+----------------+------+--------+------ ----+---------+------------+-------+-- -----------------+------------------------ 0 |节点-a-ip | 5438 |向上 | 0.500000 |初级 | 0 |真实 | 0 | 2019-03-18 22:59:19 1 |节点-b-ip | 5438 | 向下 | 0.500000 |待机| 0 |假 | 0 | 2019-03-18 22:59:19 （2 行）

编辑现在我至少能够更正此查询的最后一部分。即将备用节点添加到集群：

[postgres@node-a-hostname]$ pcp_attach_node -n 1 密码： pcp_attach_node -- 命令成功

现在最后一部分至少显示了正确的情况：

postgres=> 显示 pool_nodes; 节点ID |主机名 |港口|状态 |磅重量 |角色 |选择_cnt |负载平衡节点 |复制延迟 | last_status_change ---------+----------------+------+--------+------ ----+---------+------------+-------+-- -----------------+------------------------ 0 |节点-a-ip | 5438 |向上 | 0.500000 |初级 | 0 |假 | 0 | 2019-03-18 22:59:19 1 |节点-b-ip | 5438 | 向上 | 0.500000 |待机| 0 |真实 | 0 | 2019-03-19 11:38:38 （2 行）

但无法在 node1 上创建数据库的潜在问题仍然存在：

EDIT2：我尝试在 master 上插入和更新，它们已正确复制到 node2，但 create db 仍然无法正常工作。

【问题讨论】：

标签： postgresql database-replication high-availability postgresql-10 pgpool

【解决方案1】：

对 EDIT1 的第一次更正：确实 pcp_attach_node 帮助更正了 show pool_nodes 的输出，但与其他命令一样，它使问题更加复杂

pcp_watchdog_info -h 193.185.83.119 -p 9898 -U postgres

开始卡住了。后来才知道

pcp_attach_node -n 1

根本不需要附加备用或更正 show pool_nodes 的输出；在主 IF 原始 pcp_recovery_node 上正确完成。

好吧，最初问题的根本原因，以及后来被卡住的看门狗，是 pgpool_remote_start 脚本即使在启动待机后也没有正确完成。我可以在

中看到它

ps -ef | grep pgpool

在主人身上。

我通过here 联系了pgpool_bug_tracking 系统，他们帮助我进一步修复了它。 pgpool_remote_start 中不正确的 postgres 启动命令导致了问题，因此 pcp_recover_node 没有完成，以后也没有。

pgpool_remote_start 中的正确命令应该是这样的（我用过）：

ssh -T postgres@$REMOTE_HOST /usr/pgsql-10/bin/pg_ctl -w start -D /data/test/data 2>/dev/null 1>/dev/null </dev/null &

当我使用时

ssh -T postgres@$REMOTE_HOST /usr/pgsql-10/bin/pg_ctl start -D /data/test/data

我缺少 -w 标志。也没有将 stdout 和 stderr 重定向到 /dev/null 并且缺少发送给它的 EOF 信号。

我仍然不清楚，但对面临类似问题的人有帮助：首先在待机状态下启动 pgpool.service 或在主服务器上发出 pcp 命令之前保持其运行。

【讨论】：

更深入一点，我在这个过程中没有更改任何 wd_lifecheck_dbname，它们仍然按照我上次的尝试：wd_lifecheck_dbname = 'postgres', wd_lifecheck_user = 'postgres' 和 wd_lifecheck_password = 'postgres'只有更改 pgpool_remote_start 命令有助于解决我的问题。