有两个节点的fe,都挂了,首先重启master节点发现发错:
2022-11-25 11:27:25,797 INFO (UNKNOWN 192.168.21.5_9010_1657260956519(-1)|1) [Catalog.waitForReady():876] wait catalog to be ready. FE type: UNKNOWN. is ready: false
根据官网运维说明,造成这个问题有两种可能:
由于一直有配置priority_network且机器没更改过网络配置,所以肯定是第二个原因,所以尝试启动另一个FE,注意,此时上面那个192.168.21.5是没完全启动的(9030 mysql查询端口)根本没起来
此时另一个FE报了另外一个错
2022-11-25 13:49:29,753 ERROR (main|1) [BDBEnvironment.setup():198] error to open replicated environment. will exit.
com.sleepycat.je.EnvironmentFailureException:
(JE 18.3.12) 192.168.21.4_9010_1657261636912(-1):/home/ctgcdt/data/doris-meta/bdb recoveryTracker should overlap or follow on disk last VLSN of 40,001,327 recoveryFirst= 40,001,329 UNEXPECTED_STATE_FATAL: Unexpected internal state, unable to continue. Environment is invalid and must be closed.
这是 bdbje 的一个 bug,尚未解决。遇到这种情况,只能通过 中的故障恢复 进行操作来恢复元数据了。
以恢复模式启动master
找到元数据最新那个FE节点(192.168.21.5),以metadata_failure_recovery=true启动,它将会成为master,然后去掉配置重启,把其它follow删掉重新添加
查看所有fe
show frontends;
ALTER SYSTEM DROP FOLLOWER "192.168.21.4:9010";
sh start_fe.sh --helper 192.168.21.5:9010 --daemon
ALTER SYSTEM ADD FOLLOWER "192.168.21.4:9010";