【问题标题】:pgpool-ii in master/slave mode: How can I most easily trigger a failover?主/从模式下的 pgpool-ii:我怎样才能最容易地触发故障转移?
【发布时间】:2016-03-01 23:02:31
【问题描述】:

所以我正在使用一些本地虚拟机测试一些玩具 postgresql 基础架构,以确定 pgpool 在故障转移时的行为。我已经配置了一个基本设置,其中有两台数据库机器(192.168.0.2 和 192.168.0.3)和一台 pgpool 机器(192.168.0.4)。 192.168.0.3 已使用流复制设置为 192.168.0.2 的从属设备。 pgpool-ii 已使用以下配置:

listen_addresses = '*'
backend_hostname0 = '192.168.0.2'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/var/lib/postgresql/9.4/main/'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_hostname1 = '192.168.0.3'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/var/lib/postgresql/9.4/main/'
backend_flag1 = 'ALLOW_TO_FAILOVER'
enable_pool_hba = on
replication_mode = false
master_slave_mode = on
master_slave_sub_mode = 'stream'
fail_over_on_backend_error = true
failover_command = '/root/pgpool_failover_stream.sh %d %H /tmp/postgresql.trigger.5432'
load_balance_mode = false

我已经确认这一切正常。也就是说,当我更改 master 数据库时,复制正在工作,我可以通过示例应用程序连接到 master、slave 和 pgpool-ii 并获得我期望的结果。

现在,我启动了一个连接到 pgpool 的长时间运行的应用程序,然后尝试通过 SSH 连接到主数据库服务器并强制结束 postgres 任务(service postgresql stop 作为 root)来进行故障转移。我的应用程序继续正确执行查询,但没有发生故障转移(脚本尚未运行)。我什至测试过直接连接到主数据库,当我停止 postgres 服务时,我确实最终导致应用程序崩溃。

我做错了吗?我没有正确配置我的 pgpool 吗?还是有更好的方法来触发故障转移?

编辑:根据要求,这里是第一个错误发生的日志部分:

...
2016-03-15 18:47:15: pid 1232: DEBUG:  initializing backend status
2016-03-15 18:47:15: pid 1231: DEBUG:  initializing backend status
2016-03-15 18:47:15: pid 1230: DEBUG:  initializing backend status
2016-03-15 18:47:15: pid 1209: ERROR:  failed to authenticate
2016-03-15 18:47:15: pid 1209: DETAIL:  invalid authentication message response type, Expecting 'R' and received 'E'
2016-03-15 18:47:15: pid 1209: LOG:  find_primary_node: checking backend no 1
2016-03-15 18:47:15: pid 1209: ERROR:  failed to authenticate
2016-03-15 18:47:15: pid 1209: DETAIL:  invalid authentication message response type, Expecting 'R' and received 'E'
2016-03-15 18:47:15: pid 1209: DEBUG:  find_primary_node: no primary node found
...

奇怪的是,我仍然可以连接到 pgpool 并执行查询,所以很明显我不明白那里的东西。

编辑 2:这些是我在主服务器上 service postgresql shutdown 之后得到的错误。我展示了一切,直到 pgpool 开始关闭。

...
2016-03-16 17:24:57: pid 1012: DEBUG:  session context: clearing doing extended query messaging. DONE
2016-03-16 17:24:57: pid 1012: DEBUG:  session context: setting doing extended query messaging. DONE
2016-03-16 17:24:57: pid 1012: DEBUG:  session context: setting query in progress. DONE
2016-03-16 17:24:57: pid 1012: DEBUG:  reading backend data packet kind
2016-03-16 17:24:57: pid 1012: DETAIL:  backend:0 of 2 kind = 'E'
2016-03-16 17:24:57: pid 1012: DEBUG:  processing backend response
2016-03-16 17:24:57: pid 1012: DETAIL:  received kind 'E'(45) from backend
2016-03-16 17:24:57: pid 1012: ERROR:  unable to forward message to frontend
2016-03-16 17:24:57: pid 1012: DETAIL:  FATAL error occured on backend
2016-03-16 17:24:57: pid 1012: DEBUG:  session context: setting query in progress. DONE
2016-03-16 17:24:57: pid 1012: DEBUG:  decide where to send the queries
2016-03-16 17:24:57: pid 1012: DETAIL:  destination = 3 for query= "DISCARD ALL"
2016-03-16 17:24:57: pid 1012: DEBUG:  waiting for query response
2016-03-16 17:24:57: pid 1012: DETAIL:  waiting for backend:0 to complete the query
2016-03-16 17:24:57: pid 1012: FATAL:  unable to read data from DB node 0
2016-03-16 17:24:57: pid 1012: DETAIL:  EOF encountered with backend
2016-03-16 17:24:57: pid 998: DEBUG:  reaper handler
2016-03-16 17:24:57: pid 998: LOG:  child process with pid: 1012 exits with status 256
2016-03-16 17:24:57: pid 998: LOG:  fork a new child process with pid: 1033
2016-03-16 17:24:57: pid 998: DEBUG:  reaper handler: exiting normally
2016-03-16 17:24:57: pid 1033: DEBUG:  initializing backend status
2016-03-16 17:25:02: pid 1031: DEBUG:  PCP child receives shutdown request signal 2
2016-03-16 17:25:02: pid 1029: LOG:  child process received shutdown request signal 2
...

请注意,当主服务器关闭时,我的示例应用程序实际上确实死了。

编辑 3:在正确设置 sr_check_periodsr_check_usersr_check_password 后,我在新日志中遇到的错误,现在所有以前的错误都消失了:

2016-03-31 17:45:00: pid 18363: DEBUG:  detect error: kind: 1
2016-03-31 17:45:00: pid 18363: DEBUG:  reading backend data packet kind
2016-03-31 17:45:00: pid 18363: DETAIL:  backend:0 of 2 kind = '1'
...
2016-03-31 17:45:00: pid 18363: DEBUG:  detect error: kind: S

【问题讨论】:

    标签: postgresql failover postgresql-9.4 pgpool


    【解决方案1】:

    故障转移脚本未执行的原因可能有多种。主要步骤是将 log_destination 属性启用到 syslog 并启用调试模式 (debug_level =1) 。

    我目睹了故障转移脚本无法获取 %d、%H 的参数(特殊字符)的场景,因为脚本无法通过 ssh 连接到从站并触摸触发器文件。

    如果您发布相同的日志文件,我可以提供更多详细信息。

    根据新的日志: 我可以看到一个 ERROR: failed to authenticate 。 能否检查一下pgpool的以下参数是否配置正确

    health_check_user
    健康检查密码
    恢复用户
    恢复密码
    wd_lifecheck_user
    wd_lifecheck_password
    sr_check_user
    sr_check_password

    希望你已经按照修改 postgres 用户密码的步骤操作了

    alter user postgres password 'yourpassword'
    

    并确保在所有情况下都提供相同的密码。

    从日志来看,这似乎是一个身份验证问题。你能告诉我你正在使用的 pgpool 版本吗?

    这些是我们用于设置 3 台机器(1 台主机、1 台从机和 1 台机器用于 pgpool)的配置 我已修改以适合您的 IP 地址

     listen_addresses = '*'
      port = 5433
      socket_dir = '/var/run/postgresql'
      pcp_port = 9898
      pcp_socket_dir = '/var/run/postgresql'
    
      backend_hostname0 = '192.168.0.2'
      backend_port0 = 5432
      backend_weight0 = 1
      backend_data_directory0 = '/var/lib/postgresql/9.4/main'
      backend_flag0 = 'ALLOW_TO_FAILOVER'
    
      backend_hostname1 = '192.168.0.3'
      backend_port1 = 5432
      backend_weight1 = 1
      backend_data_directory1 = '/var/lib/postgresql/9.4/main'
      backend_flag1 = 'ALLOW_TO_FAILOVER'
    
      enable_pool_hba = on
      pool_passwd = ''
      authentication_timeout = 60
      ssl = off
      num_init_children = 4
      max_pool = 2
      child_life_time = 300 
      child_max_connections = 0
      connection_life_time = 0
      client_idle_limit = 0
      log_destination = 'stderr,syslog'
      print_timestamp = on
      log_connections = on
      log_hostname = on
      log_statement = on
      log_per_node_statement = on
      log_standby_delay = 'none'
      syslog_facility = 'LOCAL0'
      syslog_ident = 'pgpool'
      debug_level = 1
      pid_file_name = '/var/run/postgresql/pgpool.pid'
      logdir = '/var/log/postgresql'
      connection_cache = on
      reset_query_list = 'ABORT; DISCARD ALL'
    
      replication_mode = off
      replicate_select = off
      insert_lock = on
      lobj_lock_table = ''
      replication_stop_on_mismatch = off
      failover_if_affected_tuples_mismatch = off
    
      load_balance_mode = off
      ignore_leading_white_space = on
      white_function_list = ''
      black_function_list = 'nextval,setval'
    
      master_slave_mode = on
      master_slave_sub_mode = 'stream'
      sr_check_period = 10
      sr_check_user = 'postgres'
      sr_check_password = 'postgres123'
      delay_threshold = 0
      follow_master_command = ''
      parallel_mode = off
      pgpool2_hostname = 'pgmaster'
    
      system_db_hostname  = 'localhost'
      system_db_port = 5432
      system_db_dbname = 'pgpool'
      system_db_schema = 'pgpool_catalog'
      system_db_user = 'pgpool'
      system_db_password = ''
    
      health_check_period = 5
      health_check_timeout = 20
      health_check_user = 'postgres'
      health_check_password = 'postgres123'
      health_check_max_retries = 2
      health_check_retry_delay = 1
    
      failover_command = '/usr/sbin/failover_modified.sh %d "%H" %P /var/lib/postgresql/9.4/main/pgsql.trigger.5432'
      failback_command = ''
      fail_over_on_backend_error = on
      search_primary_node_timeout = 10
    
      recovery_user = 'postgres'
      recovery_password = 'postgres123'
      recovery_1st_stage_command = ''
      recovery_2nd_stage_command = ''
      recovery_timeout = 90
      client_idle_limit_in_recovery = 0
    
      use_watchdog = off
      trusted_servers = ''
      ping_path = '/bin'
      wd_hostname = ''
      wd_port = 9000
      wd_authkey = ''
      delegate_IP = ''
      ifconfig_path = '/sbin'
      if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.255.0'
      if_down_cmd = 'ifconfig eth0:0 down'
      arping_path = '/usr/sbin'  
      arping_cmd = 'arping -U $_IP_$ -w 1'
    
      clear_memqcache_on_escalation = on
      wd_escalation_command = ''
    
      wd_lifecheck_method = 'heartbeat'
      wd_interval = 10
      wd_heartbeat_port = 9694
      wd_heartbeat_keepalive = 2
      wd_heartbeat_deadtime = 30
      heartbeat_destination0 = '192.168.0.2'
      heartbeat_destination_port0 = 9694
      heartbeat_device0 = ''
    
      heartbeat_destination1 = '192.168.0.3'
      wd_life_point = 3
      wd_lifecheck_query = 'SELECT 1'
      wd_lifecheck_dbname = 'postgres'
      wd_lifecheck_user = 'postgres'
      wd_lifecheck_password = 'postgres123'
    
      relcache_expire = 0
      relcache_size = 256
      check_temp_table = on
    
      memory_cache_enabled = off
      memqcache_method = 'shmem'
      memqcache_memcached_host = 'localhost'
      memqcache_memcached_port = 11211
      memqcache_total_size = 67108864
      memqcache_max_num_cache = 1000000
      memqcache_expire = 0
      memqcache_auto_cache_invalidation = on
      memqcache_maxcache = 409600
      memqcache_cache_block_size = 1048576
      memqcache_oiddir = '/var/log/pgpool/oiddir'
      white_memqcache_table_list = ''
      black_memqcache_table_list = ''
    

    另外,我希望您已经修改了 pool_hba.conf 以启用对主从的访问

    【讨论】:

    • 您好 Raveesh,感谢您的回复!我已经启用了日志记录,甚至在启动时我注意到一些错误似乎可能是相关的。我已经编辑了我的问题以包含必要的信息。
    • 你能提供关闭主服务器后发生的日志吗?我认为这些日志并没有指向“为什么故障转移不执行脚本”的真正问题
    • 再次更新了请求的日志信息。
    • 抱歉,久违了。设置sr_check_period 和朋友删除了我之前遇到的错误。现在我有新的错误!编辑:那没用,我现在更新了问题。
    猜你喜欢
    • 2018-07-13
    • 2021-04-26
    • 2020-12-04
    • 1970-01-01
    • 1970-01-01
    • 2017-04-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多