Voina Blog (a tech warrior's blog) How to fix a failing disk in #Linux using #smartctl and #hdparm

We all hate that time when we start to have a disk failure.

In one of my RAID10 one disk was at some point in the failed status.

That should be an immediate concern for you. A 4 disks RAID10 setup can deal with 1 disk failure and even with 2 disk failures if they are part of different sets but a failing disk should be replaced as soon as possible.

The following is a procedure to recover your disk and RAID just until the replacement disk or disks arrive. This will just extend your lucky time until the new disks will arrive, but remember your RAID10 is still a limping RAID.

My advice is to always have identical disks in a RAID set so it may be necessary to replace 2 disks, the failed one and its set pair. Remember that a RAID10 in in fact a RAID0 over two RAID1 sets.

Fist identify the failing disk and then run on it the self test:

# smartctl -t short /dev/sde

This test usually takes several minutes as it is in fact the short smart self test.

If you think that there are several errors on the disk you may force the longer running test

# smartctl -t long /dev/sde

After the test is done you can see the report with the following command:

# smartctl -l selftest /dev/sde

Look for the error section it may look like:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%     48904         1915

As we can see we have a reported read error at sector 1915.

Using now another utility hdparm we should try to confirm the error.

#hdparm --read-sector 1915 /dev/sde

The result will confirm the error:

/dev/sdd:
reading sector 1915: SG_IO: bad/missing sense data, sb[]:  70 00 03 00 00 00 00 0a 40 51 f0 01 11 04 00 00 00 7b 00 00 00 00 00 00 00 00 00 00 00 00 00 00
succeeded
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 00

Now we are going to try to force the disk to relocate the bad sector by giving the following command:

#hdparm --yes-i-know-what-i-am-doing --repair-sector 1915 /dev/sde

Note the nice flag: “–yes-i-know-what-i-am-doing”, with this you confirm that you are ok with possible data corruption. We should be OK with this in case of a RAID setup.

After the command finishes let’s try again to see if the sector was relocated.

#hdparm --read-sector 1915 /dev/sde

The result looks promising:

/dev/sde:
reading sector 1915: succeeded
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

Now re-add the disk to the RAID and let the RAID do its magic and fill the sector with the correct data.

Note: This is just a band-aid solution until you replace the disk if a disk has a bad sector that is just the sign of more to follow.

How to fix a failing disk in #Linux using #smartctl and #hdparm

Like this:

Related

Leave a ReplyCancel reply

Share this:

Like this:

Related

Leave a ReplyCancel reply