#DRBD: Synchronization of sites after the failed DR (disaster recovery) site is recovered

By | December 22, 2016

This is a follow-up tutorial on the DRBD replication posts about resynchronization of sites after the failed DR site is recovered.

More than a year ago I described in a long tutorial how to use DRBD in a real production environment and the challenges associated to it. See: DRBD based disk replication of a production cluster to a remote cluster on RHEL6

Sometimes DR site where the slave drbd service receives replication data becomes unavailable. This can occur due to a fatal failure of the DR site or due to a communication failure between sites.

In both cases the DR site is considered as failed and the drbd master node will consider after a time-out that the slave node is unavailable.

One split-brain is detected on the PR node you will have something like this:

Where:
– cs:StandAlone = PR node detected the split brain and is standing alone
– ro:Primary/Unknown = PR knows is the primary
– ds:UpToDate/DUnknown = PR is up to date and has no information about the DR

On the DR site we have something like this:

Where:
– cs:WFConnection = DR node was not able to detect the split-brain before the connection to the PR was severed
– ro:Secondary/Unknown = DR knows is the secondary
– ds:UpToDate/DUnknown = DR is up to date (in its opinion) and has no information about the PR

To resynchronize the sites when the DR site is recovered the following steps must be
done.

Make sure that the cluster manager is stopped on the DR site by executing:

Add the DR database cluster IP as a resource on first cluster node on DR

Activate the volume group vg_data on the first cluster node on DR:

Activate the logical volume lv_data on the first cluster node on DR:

Note: In case we cannot see or acquire the vg_data/lv_data (error : “Not activating vg_data/lv_data since it does not pass activation filter.”) we can force the rediscovery of the vg resource.
Edit the lvm.conf and add under the volume_list directive add the entry “vg_data” and re-execute the above commands.

Start the drbd service on the first cluster node on DR. The drbd will start as a slave node, the same way it was running before the incident.

Wait for the two sites to synchronize. The synchronization process can be followed by monitoring the drbd proc device:

When the sites are synchronized stop the drbd service:

Deactivate the logical volume lv_data on the first cluster node on DR:

Deactivate the volume group vg_data on the first cluster node on DR:

Remove the DR database cluster IP as a resource on first cluster node on DR

Start the cluster manager on the both cluster nodes from DR

Check if the DRBD_Slave_Service is started on the DR cluster.

If the DRBD_Slave_Service is not started on the DR cluster force start it.

Note:
Note that in this case we changed the lvm.conf file and this is no longer matching the one from the initrd.
In case of a reboot of the node we are going to hit the “HA LVM: Improper setup detected” issue when we want to start any cluster service on this node.

This isssue is described by RedHat Solution 21622
Basically the solution is to regenerate the initrd image for the current kernel on this node.

Make a backup of the image:

Now rebuild the initramfs for the current kernel version:

Update:
There is a case when during the split-brain recovery the connection between sites dies again.
In that situation the PR site will look like:

Where:
– cs:Timeout = PR node tried to resynchronize the DR site but the operation times out
– ro:Primary/Unknown = PR knows is the primary
– ds:UpToDate/DUnknown = PR is up to date and has no information about the DR

In this case we will have to make sure first that the connection between sites is OK and after that just restart the drbd service on the PR site.

After the restart the PR site will automatically try to resynchronize

On the DR site we can also see the progress:

Advertisements