#DRBD: Synchronization of sites after the connection between sites is down for a long time

By | August 10, 2017

In a DRBD setup as described in #DRBD based disk replication of a production cluster to a remote site cluster on RHEL 6 there is a special case when due to some external issues the communication line between the sites is down due to technical reason for a longer period of time (days).

The line was taken down in a controlled mode after business hours so no sudden interruption of replication during a peak time occurred. Both primary and secondary sites show that they are “up-to-date”, and the only difference is that primary has “Timeout” status while secondary has “WFConnection”.

The status “WFConnection” of the remote site node indicates that the node is not receiving data from the other node and no split brain was detected. This means that up to a point data was received and confirmed and then silence followed. The node in WFConnection state still assumes that the other party is OK and will resume sending data.

The status “Timeout” of the live site node indicates that up to a point replication data was delivered and confirmation was received for data from the remote node. At some point the live node detects that the connection to the other node is down. After several trials gives up and changes the status to “Timeout”. This is not a “split brain” situation because all the data that was delivered is confirmed as received by the remote node.

In this case the solution is to restart the replication service on the node in “Timeout” status after the communication line between sites is re-established. This will force it to retry to deliver data. There is no need to restart the node in WFConnection status.