Voina Blog (a tech warrior's blog) #DRBD based disk replication of a production cluster to a remote site cluster on RHEL 6

1 General Considerations

Our example enterprise applications run on a Linux Cluster with a shared cluster storage resource. Having this HA setup ensures that we have a high rate of service availability on the production site.
To ensure that the disruption time in service of our enterprise application is minimal as possible the best solution is to have an identical environment on a remote site. The sites will be kept in a mirror state by using site to site data replication using DRBD. To ensure that the data replication is not decreasing the performance of the application systems even if there is a distance between sites the replication of data is done in an asynchronous way using DRBD replication.

Due to the fact that our applications use exclusively the database as the storage place for all the data we will have to replicate only the database related files, that are stored on both cluster sites under the logical volumes /var/vg_data/lv_data. The same configuration can be done also if we want to replicate also the application files or an entire virtual appliance.

2 DRBD configuration

2.1 Prerequisites

There are several prerequisites that must be met before replication is configured between the sites:

1. Install the drbd and kmod-drbd packages. Make sure the latest version for RedHat Enterprise Linux 6 are installed.
2. Make sure the routes are added such as the servers from the two sites are visible to each other.
3. Make sure the logical volumes /var/vg_data/lv_data are correctly created on both production and remote sites. To be correctly created one must ensure that only 2/3 rds of the vg_data volume group space is allocated to the lv_data logical volume. The rest of the vg_data space will be used by the drdb mechanism of replication to store temporary replication data.
4. Make a backup of the data from the /data directory (the directory where /var/vg_data/lv_data is mounted )
5. Remove the logical volume lv_data on both production and remote
6. Recreate the logical volume lv_dat on both production and remote

# lvcreate -n lv_data --size 100G vg_data

7. Activate the logical volume lv_data

# lvscan
# lvchange -a y /dev/vg_data/lv_data

3 Replication configurations on server nodes

There are several steps that must be performed to correctly configure the replication on both sites.

Make sure that on both sites, on both nodes (on 4 servers in total two on each site), the /etc/drbd.conf looks like the following:
# You can find an example in /usr/share/doc/drbd…/drbd.conf.example

include "drbd.d/global_common.conf";
include "drbd.d/*.res";

Make sure that on both sites, on both nodes (on 4 servers in total two on each site), the /etc/drbd.d/global_common.conf looks like the following:

global {
usage-count no;
# minor-count dialog-refresh disable-ip-verification
}
common {
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 35 -- -c 16k";
after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
}
startup {
wfc-timeout 10;
degr-wfc-timeout 10;
# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
}
options {

# cpu-mask on-no-data-accessible
}

disk {
disk-barrier no;
disk-flushes no;
disk-drain yes;
md-flushes no;
#resync rate to be increased after network is fixed
resync-rate 40M;
# size max-bio-bvecs on-io-error fencing disk-barrier disk-flushes
# disk-drain md-flushes resync-rate resync-after al-extents
# c-plan-ahead c-delay-target c-fill-target c-max-rate
# c-min-rate disk-timeout
}

net {
protocol C;
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 0;
cram-hmac-alg sha1;
csums-alg sha1;
verify-alg sha1;
# protocol timeout max-epoch-size max-buffers unplug-watermark
# connect-int ping-int sndbuf-size rcvbuf-size ko-count
# allow-two-primaries cram-hmac-alg shared-secret after-sb-0pri
# after-sb-1pri after-sb-2pri always-asbp rr-conflict
# ping-timeout data-integrity-alg tcp-cork on-congestion
# congestion-fill congestion-extents csums-alg verify-alg
# use-rle
}
}

Make sure that on both sites, on both nodes (on 4 servers in total two on each site), the /etc/drbd.d/repdata.res looks like the following:

resource repdata {
device /dev/drbd0;
disk /dev/vg_data/lv_data;
meta-disk internal;

floating 172.20.101.3:7788;
floating 172.20.101.19:7788;

net {
shared-secret "43ry38nfmx7nx7m23eedSWE";
}
}

Change on both nodes on each site /usr/lib/drbd/snapshot-resync-target-lvm.sh to include lvm tagging for the snapshot volume created during resync. The changed line is :

lvcreate -s --addtag `hostname` -n $SNAP_NAME -L ${SNAP_SIZE}k $LVC_OPTIONS $VG_NAME/$LV_NAME

The above defines the replication resource repdata. Note that on both sites the device /dev/drbd0 will be created over the logical volume /dev/vg_data/lv_data. There are two floating IP resources added to this replication resource, which represent the end points of the replication. The IPs used by the repdata resource are the database cluster IPs from the PR and DR sites.

4 Replication configurations on clusters

There are certain configurations changes that are one to the clusters configuration in order to accommodate the replication of data between sites.

4.1 Replication Global Resources on production cluster

Add a resource of type “Script” with the properties:
Name DRBDSlave Full path to script file: /etc/init.d/drbd
Add a resource of type “Script” with the properties:
Name DRBDMaster Full path to script file: /etc/init.d/master
The above scripts are from the drbd and master are in fact the same script from drbd resources:

#!/bin/bash
#
# chkconfig: - 70 08
# description: Loads and unloads the drbd module
#
# Copyright 2001-2010 LINBIT
#
# Philipp Reisner, Lars Ellenberg
#
### BEGIN INIT INFO
# Provides: drbd
# Required-Start: $local_fs $network $syslog
# Required-Stop: $local_fs $network $syslog
# Should-Start: sshd multipathd
# Should-Stop: sshd multipathd
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# X-Start-Before: heartbeat corosync
# X-Stop-After: heartbeat corosync
# X-Interactive: true
# Short-Description: Control drbd resources.
### END INIT INFO

DEFAULTFILE="/etc/default/drbd"
DRBDADM="/sbin/drbdadm"
DRBDSETUP="/sbin/drbdsetup"
PROC_DRBD="/proc/drbd"
MODPROBE="/sbin/modprobe"
RMMOD="/sbin/rmmod"
UDEV_TIMEOUT=10
ADD_MOD_PARAM=""

if [ -f $DEFAULTFILE ]; then
. $DEFAULTFILE
fi

test -f $DRBDADM || exit 5

# we only use these two functions, define fallback versions of them ...
log_daemon_msg() { echo -n "${1:-}: ${2:-}"; }
log_end_msg() { echo "."; }
# ... and let the lsb override them, if it thinks it knows better.
if [ -f /lib/lsb/init-functions ]; then
. /lib/lsb/init-functions
fi

function assure_module_is_loaded
{
[ -e "$PROC_DRBD" ] && return

$MODPROBE -s drbd $ADD_MOD_PARAM || {
echo "Can not load the drbd module."$'\n'; exit 20
}
# tell klogd to reload module symbol information ...
[ -e /var/run/klogd.pid ] && [ -x /sbin/klogd ] && /sbin/klogd -i
}

drbd_pretty_status()
{
local proc_drbd=$1
# add resource names
if ! type column &> /dev/null ||
! type paste &> /dev/null ||
! type join &> /dev/null ||
! type sed &> /dev/null ||
! type tr &> /dev/null
then
cat "$proc_drbd"
return
fi
sed -e '2q' < "$proc_drbd"
sed_script=$(
i=0;
_sh_status_process() {
let i++ ;
stacked=${_stacked_on:+"^^${_stacked_on_minor:-${_stacked_on//[!a-zA-Z0-9_ -]/_}}"}
printf "s|^ *%u:|%6u\t&%s%s|\n" \
$_minor $i \
"${_res_name//[!a-zA-Z0-9_ -]/_}" "$stacked"
};
eval "$(drbdadm sh-status)" )

p() {
sed -e "1,2d" \
-e "$sed_script" \
-e '/^ *[0-9]\+: cs:Unconfigured/d;' \
-e 's/^\(.* cs:.*[^ ]\) \([rs]...\)$/\1 - \2/g' \
-e 's/^\(.* \)cs:\([^ ]* \)st:\([^ ]* \)ds:\([^ ]*\)/\1\2\3\4/' \
-e 's/^\(.* \)cs:\([^ ]* \)ro:\([^ ]* \)ds:\([^ ]*\)/\1\2\3\4/' \
-e 's/^\(.* \)cs:\([^ ]*\)$/\1\2/' \
-e 's/^ *[0-9]\+:/ x &??not-found??/;' \
-e '/^$/d;/ns:.*nr:.*dw:/d;/resync:/d;/act_log:/d;' \
-e 's/^\(.\[.*\)\(sync.ed:\)/... ... \2/;/^.finish:/d;' \
-e 's/^\(.[0-9 %]*oos:\)/... ... \1/' \
< "$proc_drbd" | tr -s '\t ' ' '
}
m() {
join -1 2 -2 1 -o 1.1,2.2,2.3 \
<( ( drbdadm sh-dev all ; drbdadm -S sh-dev all ) | cat -n | sort -k2,2) \
<(sort < /proc/mounts ) |
sort -n | tr -s '\t ' ' ' | sed -e 's/^ *//'
}
# echo "=== p ==="
# p
# echo "=== m ==="
# m
# echo "========="
# join -a1 <(p|sort) <(m|sort)
# echo "========="
(
echo m:res cs ro ds p mounted fstype
join -a1 <(p|sort) <(m|sort) | cut -d' ' -f2-6,8- | sort -k1,1n -k2,2 ) | column -t } # Try to settle regardless of udev version or presence, # so "/etc/init.d/drbd stop" is able to rmmod, without interfering # temporary module references caused by udev scanning the devices. # But don't wait too long. _udev_settle() { if udevadm version ; then # ok, we have udevadm, use it. udevadm settle --timeout=5 else # if udevsettle is not there, # no matter. udevsettle --timeout=5 fi } case "$1" in start) # Just in case drbdadm want to display any errors in the configuration # file, or we need to ask the user about registering this installation # at http://usage.drbd.org, we call drbdadm here without any IO # redirection. # If "no op" has a non-zero exit code, the config is unusable, # and every other command will fail. log_daemon_msg "Starting DRBD resources" if ! out=$($DRBDADM sh-nop 2>&1) ; then
printf "\n%s\n" "$out" >&2
log_end_msg 1
exit 1
fi
assure_module_is_loaded

$DRBDADM adjust-with-progress all
[[ $? -gt 1 ]] && exit 20

# make sure udev has time to create the device files
# FIXME this probably should, on platforms that have it,
# use udevadm settle --timeout=X --exit-if-exists=$DEVICE
for DEVICE in `$DRBDADM sh-dev all`; do
UDEV_TIMEOUT_LOCAL=$UDEV_TIMEOUT
while [ ! -e $DEVICE ] && [ $UDEV_TIMEOUT_LOCAL -gt 0 ] ; do
sleep 1
UDEV_TIMEOUT_LOCAL=$(( $UDEV_TIMEOUT_LOCAL-1 ))
done
done

[ -d /var/lock/subsys ] && touch /var/lock/subsys/drbd # for RedHat
$DRBDADM wait-con-int # User interruptible version of wait-connect all

$DRBDADM sh-b-pri all # Become primary if configured
log_end_msg 0
;;
stop)
$DRBDADM sh-nop
log_daemon_msg "Stopping all DRBD resources"
for try in 1 2; do
if [ -e $PROC_DRBD ] ; then
[[ $try = 2 ]] && echo "Retrying once..."
# bypass drbdadm and drbd config file and everything,
# to avoid leaving devices around that are not referenced by
# the current config file, or in case the current config file
# does not parse for some reason.
for d in /dev/drbd* ; do
[ -L "$d" ] && continue
[ -b "$d" ] || continue
M=$(umount "$d" 2>&1)
case $M in
*" not mounted") :;;
*) echo "$M" >&2 ;;
esac
done
for res in $(drbdsetup all show | sed -ne 's/^resource \(.*\) {$/\1/p'); do
drbdsetup "$res" down
done
_udev_settle &> /dev/null
$RMMOD drbd && break
fi
done
[ -f /var/lock/subsys/drbd ] && rm /var/lock/subsys/drbd
log_end_msg 0
;;
status)
# NEEDS to be heartbeat friendly...
# so: put some "OK" in the output.
if [ -e $PROC_DRBD ]; then
echo "drbd driver loaded OK; device status:"
drbd_pretty_status $PROC_DRBD 2>/dev/null
exit 0
else
echo >&2 "drbd not loaded"
exit 3
fi
;;
reload)
$DRBDADM sh-nop
log_daemon_msg "Reloading DRBD configuration"
$DRBDADM adjust all
log_end_msg 0
;;
restart|force-reload)
( . $0 stop )
( . $0 start )
;;
*)
echo "Usage: /etc/init.d/drbd {start|stop|status|reload|restart|force-reload}"
exit 1
;;
esac

exit 0

4.2 Replication Service Resources on cluster
Go to “Service Groups” and add a new service with the following properties:
Service Name: DRBD_Slave_Service Automatically Start This Service: off Failover Domain: DATABASE Recovery Policy: Relocate Add in place the following resources:
Replication cluster service virtual IP
To add the resource to the cluster go to add resource menu and add the resource as an “IP Address” resource type with properties:
IP Address: 172.20.101.3/28 Netmask Bits: 28 Monitor Link: on Number of seconds to sleep after removing an IP address: 10
As a child resource add the following resource:

Replication data logical volume
An DBDataLVM_Slave resource of type HA LVM will be added having the following properties:
Volume Group Name: vg_data Logical Volume Name: lv_data
As a child resource add the global resource DRBDSlave.

4.3 Replication related changes to the Database Cluster Service on production cluster

To accommodate site replication the Database_Service cluster service (see Service Groups) must be changed to include the DRBD scripts.

Delete the Database_Service cluster service and recreate it as following:

Go to “Service Groups” and add a new service with the following properties:
Service Name: Database_Service Automatically Start This Service: on Failover Domain: DATABASE Recovery Policy: Relocate
Make sure to Submit changes.

Add in the following order the following resources to the service by using “Add Resource” button. The order in which the resources are added is very important, this is the order in which they are initialized at service start. When the service is stopped the resources are stopped in the reverse order. There are dependencies between resources so first we have to add the resources which are independent and the the ones which depend on availability of other resources.

1. Add “172.20.101.19” resource
2. Add as a child resource of the previous resource “DBLogLVM” resource
3. Add as a child resource of the previous resource “DBLog” resource
4. Add as a child resource of the previous resource “DBDataLVM” resource
5. Add as a child resource of the previous resource “DRBDMaster” resource
6. Add as a child resource of the previous resource “DBData” resource
7. Add as a child resource of the previous resource “DB11g” resource

4.4 Replication Global Resources on remote cluster

Add a resource of type “Script” with the properties:
Name DRBDSlave Full path to script file: /etc/init.d/drbd
Add a resource of type “Script” with the properties:
Name DRBDMaster Full path to script file: /etc/init.d/master
4.5 Replication Service Resources on remote cluster

Go to “Service Groups” and add a new service with the following properties:
Service Name: DRBD_Slave_Service Automatically Start This Service: off Failover Domain: DATABASE Recovery Policy: Relocate Add in place the following resources:

Replication cluster service virtual IP
To add the resource to the cluster go to add resource menu and add the resource as an “IP Address” resource type with properties:
IP Address: 172.20.101.19/28
Netmask Bits: 28
Monitor Link: on
Number of seconds to sleep after removing an IP address: 10
As a child resource add the following resource:
Replication data logical volume An DBDataLVM_Slave resource of type HA LVM will be added having the following properties: Volume Group Name: vg_data Logical Volume Name: lv_data
As a child resource add the global resource DRBDSlave.

4.6 Replication related changes to the Database Cluster Service on remote cluster

To accommodate site replication the Database_Service cluster service (see Service Groups) must be changed to include the DRBD scripts.

Delete the Database_Service cluster service and recreate it as following:

8. Add “172.20.101.3” resource
9. Add as a child resource of the previous resource “DBLogLVM” resource
10. Add as a child resource of the previous resource “DBLog” resource
11. Add as a child resource of the previous resource “DBDataLVM” resource
12. Add as a child resource of the previous resource “DRBDMaster” resource
13. Add as a child resource of the previous resource “DBData” resource
14. Add as a child resource of the previous resource “DB11g” resource

5 Replication General Set-up

5.1 Synchronizing the sites for the first time

First synchronization must be done in a different way than the final configuration.

Add by hand the production database cluster IP as a resource on first cluster node on production and remote database cluster IP as a resource on first cluster node on remote.

Initialize the meta-data area on disk before starting drbd on both production and remote site. Note that, due to the fact the replication resource repdata is a cluster resource, this operation must be done only on one node of the production and remote clusters.
On both sites execute the following command:

# drbdadm create-md repdata

Start drbd on both first cluster nodes (on which the database cluster IPs were bonded) from production and remote.

#/etc/init.d/drbd start

As you can see , both nodes are secondary, which is normal. we need to decide which node will act as a primary node.To do this we have to initiate the first ‘full sync’ between the two nodes.

On node one on cluster execute the following

# drbdadm -- --overwrite-data-of-peer primary repdata

Wait for the first site synchronization to be done by monitoring the situation on the above node with commanda:

# watch -n 1 cat /proc/drbd

Note that it may take some time as the first time the whole 100G of the lv_data logical volume are going to be replicated from PR to DR sites.

After the synchronization between sites is done, we can now format /dev/drbd0 the new drbd device.

# mkfs.ext4 /dev/drbd0

Mount the drbd device under the /data folder and then copy the backup data of old /data folder into it.

5.2 Replication from live site to the standby site

After the initial site to site synchronization was done we are ready to start the normal replication operations.

On the standby site by starting the DRBD_Slave_Service implicitly the DRBD of the standby site is set-up as the slave DRBD that generates data

On the live site by starting the Database_Service implicitly the DRBD of the live site is set-up as the master DRBD that generates data

It is important to note that always the slave site must be started first and then the master site.

6 Switching from one site to another

When doing a site switch from the live site to the standby site the following steps must be performed:
1. Stop the Application_Service on the live site
2. Check that the sites are in sync. If not wait for the standby site to receive all the replication data.
This can be checked by executing service drbd status on the standby site
3. Stop the Database_Service on the live site
4. Stop the DRBD_Slave_Service on the standby site
5. Start the DRBD_Slave_Service on the live site
6. Start the Database_Service on the standby site
7. Start the Application_Service on the standby site
8. Check that the sites are in sync.
This can be checked by executing service drbd status on the standby site and live site.

The site switch is now complete.

7 Replication Monitoring

At any time the replication status can be easily monitored by running the following command on both sites:

#service drbd status

8 Promoting the DR site as the main site without stopping the PR site
In some situations we have to promote the DR site as the main site without performing a nice site switch as described in paragraph 6
The only situations when this must be performed is in the following cases:
1. The live site experienced a total failure.
2. The live site is completely isolated from the participants.

The following steps must be performed:

1. Ensure that the conditions from the above paragraph are met.
2. Start the Database_Service on the standby site
3. Start the Application_Service on the standby site

9 DRBD split brain recovery

As a result of the DR promotion as the main site as instructed in paragraph 8 after the live site is recovered the so called “split brain” condition will be experienced by the replication mechanism.

DRBD detects split brain at the time connectivity between production and remote sites becomes available again and the nodes exchange the initial DRBD protocol handshake. If DRBD detects that both nodes are (or were at some point, while disconnected) in the primary role, it immediately tears down the replication connection. The tell-tale sign of this is a message like the following appearing in the system log:
Split-Brain detected, dropping connection!

After split brain has been detected, one node will always have the resource in a StandAlone connection state. The other might either also be in the StandAlone state (if both nodes detected the split brain simultaneously), or in WFConnection (if the peer tore down the connection before the other node had a chance to detect split brain).
We are selecting PR site as the node whose modifications will be discarded (this node is referred to as the split brain victim).
The split brain victim needs to be in the connection state of StandAlone or the following commands will return an error. You can ensure it is standalone by issuing on the PR site:

# drbdadm disconnect repdata

Also on the PR site execute the following command to force the site as the secondary replication site (the site receiving the replicated data)

# drbdadm secondary repdata
# drbdadm connect --discard-my-data repdata

On the remote site (the split brain survivor), if its connection state is also StandAlone, you would enter:

# drbdadm connect repdata

You may omit this step if the node is already in the WFConnection state; it will then reconnect automatically.
Upon connection, your split brain victim (PR site) immediately changes its connection state to SyncTarget, and has its modifications overwritten by the remaining primary node. After re-synchronization has completed, the split brain is considered resolved and the two nodes form a fully consistent, redundant replicated storage system again.
At any time the replication status can be easily monitored by running the following command on both sites:

#service drbd status

#DRBD based disk replication of a production cluster to a remote site cluster on RHEL 6

Like this:

Related

2 thoughts on “#DRBD based disk replication of a production cluster to a remote site cluster on RHEL 6”

Leave a ReplyCancel reply

Share this:

Like this:

Related

2 thoughts on “#DRBD based disk replication of a production cluster to a remote site cluster on RHEL 6”

Leave a ReplyCancel reply