Restoring a Failed member in InnoDB Cluster

For MySQL DBA its a common issue that one of the MySQL InnoDB Cluster node got corrupted. In that case we need to restore MySQL cluster node from mysqlbackup. Usually backup in MySQL InnoDB cluster is taken from one of the node in the cluster, if the node corrupted and backup node are same then you can simply restore the backup and start group replication.

But if the backup is node taken from the corrupted node ,We need to take some additional steps. In this blog we will discuss about the same scenario. Assume that we have three MySQL node mysqlhost01,mysqlhost02 & mysqlhost03. And the node which got corrupted is mysqlhost03 and backup is taken from the mysqlhost02.We will follow below steps to restore failed node mysqlhost03:

1) Move the MySQL backup from mysqlhost02 to mysqlhost03.

2) Change parameter group_replication_start_on_boot to false.

set parameter group_replication_start_on_boot to false so that post MySQL restart Group replication services are not started automatically.

Connect to mysqlhost03Server:
mysql>select * from performance_schema.persisted_variables where variable_name like 'group_replication_start_on_boot%';
mysql>set persist group_replication_start_on_boot=false;

3) Stop MySQL on node mysqlhost03 and take backup of the existing auto.cnf and mysqld-auto.cnf . This is required to preserve the UUID for the mysqlhost03.

4) Remove all the data file and other database related files.

Check the variables innodb_data_home_dir, innodb_log_group_home_dir, and innodb_undo_directory and make sure there is no files in these directory else backup restoration might fail.

5) Restore the backup mysqlhost02 to mysqlhost03. (Using Restore MySQL InnoDB Cluster from mysqlbackup)

mysqlbackup --defaults-file=/etc/my.cnf \
--datadir=/var/lib/mysql \
--backup-image=/backups/my.mbi_2206_1429 \
--backup-dir=/tmp/restore_`date +%d%m_%H%M` copy-back-and-apply-log

The command above assumes that the binary logs and relay logs on mysqlhost02 and mysqlhost03 have the same base name and are at the same location on the two servers. If these conditions are not met, you should use the --log-bin and --relay-log options to restore the binary log and relay log to their original file paths on mysqlhost03 . For example, if you know that on mysqlhost03 the binary log's base name is mysqlhost03 -bin and the relay-log's base name is mysqlhost03 -relay-bin, your restore command should look like:

mysqlbackup --defaults-file=/etc/my.cnf \
--datadir=/var/lib/mysql \
--backup-image=/backups/my.mbi_***_**** \
--log-bin=mysqlhost03-bin --relay-log=mysqlhost03-relay-bin \
--backup-dir=/tmp/restore_`date +%d%m_%H%M` copy-back-and-apply-log

6) Restore auto.cnf and mysqld-auto.cnf files backed up step 3.

For automatic rejoining of the restore node mysqlhost03 the UUID and other system variable must be same as is was before restore. In order to achieve same we need to restore the auto.cnf and mysqld-auto.cnf file.

7) Start MySQL on mysqlhost03 and verify parameter gtid_execute.

MySQL> systemctl start mysqld

Check the value of the gtid_execute and gtid in the backup_gtid_executed.sql file . These two values must be same, In order to start replication from the correct gtid value.

If these value are not same,then the change the gtid_execute value to same as that of the value in the backu_gtid_executed.sql.

mysql> SET SQL_LOG_BIN=OFF;
mysql> SOURCE backup_gtid_executed.sql
mysql> SET SQL_LOG_BIN=ON;

8) Change the value of the parameter group_replication_start_on_boot to ture and restart MySQL.

set parameter group_replication_start_on_boot to true so that post MySQL restart Group replication services are started automatically.

Connect to mysqlhost03Server:
mysql>select * from performance_schema.persisted_variables where variable_name like 'group_replication_start_on_boot%';

mysql>set persist group_replication_start_on_boot=false;

Restart MySQL on the mysqlhost03:

MySQL> systemctl restart mysqld

9) Check the status of group and group replication.

Initially ,mysqlhost03 will be in recovery mode and in both Group replication status and cluster status as it will be applying all the change from gtid_execute parameter value onwards.

In order to check member status of the restored instance, issue:

This shows that mysqlhost03 is applying transactions to catch up with the group. Once it has caught up with the rest of the group, its member_state changes to ONLINE:

Check cluster status:

mysqlsh>var cluster = dba.getCluster()
mysqlsh>cluster.status()