#DRBD investigate and solve a sudden Diskless issue

By | September 19, 2017

Even when you think that you know something well enough you discover that there are some corner cases you never encountered. This is the main reason I like IT system administration tasks, you never get bored.
This post is about one of this corner cases in a DRBD setup described in post #DRBD based disk replication of a production cluster to a remote site cluster on RHEL 6

ISSUE: Suddenly the drbd setup that I have deployed started to act weird.
Checking the status of the DRBD end points I got :

On primary site:

This is weird, this Primary seems to be OK and up to date, but the secondary site is in a Diskless status.

On secondary site:

By looking at the above outputs one can conclude that:

– primary and secondary sites have no problem seeing each other. Both are in “Connected” status.
– Primary site is up to date
– Secondary site is in “Diskless” status.

First we can conclude that there is nothing wrong with the connection between sites and we are not in a “split-brain” situation, that is the usual case when I have some DRBD exceptional situation. Then I focused on the Diskless status.

Acording to the definition from the DRBD documentation:

Diskless. No local block device has been assigned to the DRBD driver. This may mean that the resource has never attached to its backing device, that it has been manually detached using drbdadm detach, or that it automatically detached after a lower-level I/O error.

Looking at the definition I concluded that the only possible case is the last one:

resource … automatically detached after a lower-level I/O error

So now that we know is a low level error I tried to investigate the lower levels.

As /dev/drbd0 device is defined on Secondary site over a LVM stack: lv_data logical volume defined over a vg_data volume.

This looks OK.

This is not OK. Obviously if lv _data is not active then /dev/drbd0 that is defined over /dev/vg_data/lv_data device cannot be initialized. This explains the “Diskless” status on the Secondary site.

Lets try to force activate lv_data

This did not work the result was:

At this point we know that lv_data cannot be activated because some lower level device over which lv_data is defined is not found.
The suspect at this point is one of the partitions over which vg_data is defined. I forgot to mention that vg_data extends over two partitions defined over two multipath ISCSI targets.

Checking the existing devices on the server I got:

This is not a good sign I can see only one mpath type partition. There should be two, an old saved output looking as:

At this point of the investigation we know that DRBD is Diskless on the Secondary site because underlining lv_data logical volume defined in vg_data volume is not active because one of the partitions over which vg_data is extended is not visible.

The only explanation at this point is that the kernel does not have the latest partition table. So lets try a partprobe to refresh it.

OK so the above confirms that there is a problem in refreshing the partition table with the missing partition.

The simple obvious solution is a reboot of the server. What caused the issues ? There are lost of cases but I never actually discovered which one was.

After a reboot of the server /dev/mapper/mpathop1 appeared and lv_data was in active state.

Suddenly the status of the DRBD end points changed and I can see on the DRBD Secondary node that the replication is started.

conclusion: DRBD is tricky and with lots of personality but is not always its fault. It is very important that all the underlining levels of the storage cake are also OK.

[paypal_donation_button]

Advertisements

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.