Voina Blog (a tech warrior's blog) #DRBD: Synchronization of sites after the failed DR (disaster recovery) site is recovered

This is a follow-up tutorial on the DRBD replication posts about resynchronization of sites after the failed DR site is recovered.

More than a year ago I described in a long tutorial how to use DRBD in a real production environment and the challenges associated to it. See: DRBD based disk replication of a production cluster to a remote cluster on RHEL6

Sometimes DR site where the slave drbd service receives replication data becomes unavailable. This can occur due to a fatal failure of the DR site or due to a communication failure between sites.

In both cases the DR site is considered as failed and the drbd master node will consider after a time-out that the slave node is unavailable.

One split-brain is detected on the PR node you will have something like this:


cat /proc/drbd

version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil@Build64R6, 2013-10-14 15:33:06
 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown A r-----
    ns:265914648 nr:0 dw:538224448 dr:1510207293 al:79061 bm:2397 lo:0 pe:11 ua:0 ap:0 ep:1 wo:d oos:35371204

Where:
– cs:StandAlone = PR node detected the split brain and is standing alone
– ro:Primary/Unknown = PR knows is the primary
– ds:UpToDate/DUnknown = PR is up to date and has no information about the DR

On the DR site we have something like this:


cat /proc/drbd                                                                                                         

version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil@Build64R6, 2013-10-14 15:33:06
 0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown A r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0

Where:
– cs:WFConnection = DR node was not able to detect the split-brain before the connection to the PR was severed
– ro:Secondary/Unknown = DR knows is the secondary
– ds:UpToDate/DUnknown = DR is up to date (in its opinion) and has no information about the PR

To resynchronize the sites when the DR site is recovered the following steps must be
done.

Make sure that the cluster manager is stopped on the DR site by executing:


# service rgmanager stop

Add the DR database cluster IP as a resource on first cluster node on DR


# ifconfig bond0:1 172.20.101.19/28

Activate the volume group vg_data on the first cluster node on DR:


# vgscan
# vgchange -ay vg_data

Activate the logical volume lv_data on the first cluster node on DR:


# lvscan
# lvchange -a y vg_data/lv_data

Note: In case we cannot see or acquire the vg_data/lv_data (error : “Not activating vg_data/lv_data since it does not pass activation filter.”) we can force the rediscovery of the vg resource.
Edit the lvm.conf and add under the volume_list directive add the entry “vg_data” and re-execute the above commands.


# vi /etc/lvm/lvm.conf
# vgscan
# vgchange -ay vg_data

Start the drbd service on the first cluster node on DR. The drbd will start as a slave node, the same way it was running before the incident.


# service drbd start

Wait for the two sites to synchronize. The synchronization process can be followed by monitoring the drbd proc device:


# watch -n 1 cat /proc/drbd

When the sites are synchronized stop the drbd service:


#service drbd stop

Deactivate the logical volume lv_data on the first cluster node on DR:


# lvscan
# lvchange -a n vg_data/lv_data

Deactivate the volume group vg_data on the first cluster node on DR:


# vgscan
# vgchange -an vg_data

Remove the DR database cluster IP as a resource on first cluster node on DR


# ifconfig bond0:1 del 172.20.101.19/28

Start the cluster manager on the both cluster nodes from DR


#service rgmanger start-domain

Check if the DRBD_Slave_Service is started on the DR cluster.


#clustat

If the DRBD_Slave_Service is not started on the DR cluster force start it.


#clusvcadm -e DRBD_Slave_Service -F

Note:
Note that in this case we changed the lvm.conf file and this is no longer matching the one from the initrd.
In case of a reboot of the node we are going to hit the “HA LVM: Improper setup detected” issue when we want to start any cluster service on this node.


Apr 22 13:07:30 PRODB rgmanager[7864]: [lvm] HA LVM:  Improper setup detected
Apr 22 13:07:30 PRODB rgmanager[7884]: [lvm] * initrd image needs to be newer than lvm.conf

This isssue is described by RedHat Solution 21622
Basically the solution is to regenerate the initrd image for the current kernel on this node.

Make a backup of the image:


$ cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.$(date +%m-%d-%H%M%S).bak

Now rebuild the initramfs for the current kernel version:


$ dracut -f -v

Update:
There is a case when during the split-brain recovery the connection between sites dies again.
In that situation the PR site will look like:


cat /proc/drbd
version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil@Build64R6, 2013-10-14 15:33:06
 0: cs:Timeout ro:Primary/Unknown ds:UpToDate/DUnknown A r-----
    ns:265914648 nr:0 dw:538224448 dr:1510207293 al:79061 bm:2397 lo:0 pe:11 ua:0 ap:0 ep:1 wo:d oos:35371204

Where:
– cs:Timeout = PR node tried to resynchronize the DR site but the operation times out
– ro:Primary/Unknown = PR knows is the primary
– ds:UpToDate/DUnknown = PR is up to date and has no information about the DR

In this case we will have to make sure first that the connection between sites is OK and after that just restart the drbd service on the PR site.

After the restart the PR site will automatically try to resynchronize


version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil@Build64R6, 2013-10-14 15:33:06
m:res      cs          ro                 ds                     p  mounted  fstype
0:repdata  SyncSource  Primary/Secondary  UpToDate/Inconsistent  A  /data    ext4
...        sync'ed:    4.7%               (32944/34556)M

On the DR site we can also see the progress:


cat /proc/drbd

version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil@Build64R6, 2013-10-14 15:33:06
 0: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate A r-----
    ns:0 nr:1862072 dw:1862036 dr:1838940 al:0 bm:238 lo:1 pe:0 ua:1 ap:0 ep:1 wo:d oos:33548452
        [>...................] sync'ed:  5.3% (32760/34556)M
        finish: 2:33:31 speed: 3,632 (3,260) want: 1,120 K/sec

[paypal_donation_button]

#DRBD: Synchronization of sites after the failed DR (disaster recovery) site is recovered

Like this:

Related

Leave a ReplyCancel reply

Share this:

Like this:

Related

Leave a ReplyCancel reply