[ URL ] : https://notes.fr/2024/05/26/RAID-Change-disk-defore-it-dies/
[ CATEGORY ] : /Linux/Administration

[ TAGS ] : RAID, Software, dmesg, lsblk, lsscsi, mdadm, mdstat, recovery, smartctl

[ UPDATED ] : 2024-06-02 12:46

[ AUTHOR ] : 4l1N3

Change a RAID disk before it dies

Change a disk from a RAID software before it dies.

Check disks health

Useful commands:

# dmesg | grep sd*                  // Display kernel system logs
# lsscsi -s                         // Display informations about SCSI devices
# lsblk                             // Display informations about disks partitions
# smartctl -a /dev/sdX -d cciss,N   // Display S.M.A.R.T. disks informations
# cat /proc/mdstat                  // Display disks and RAID
# mdadm --detail /dev/mdX           // Manage RAID

TL;DR How to change a RAID disk

Change a disk from a RAID before it crash procedure:

# cat /proc/mdstat                            // Display disks and RAID
# fdisk -l                                    // display informatoins
# lsblk                                       // Show RAID ans disks' names
# smartctl -a /dev/sdX -d cciss,N             // Display S.M.A.R.T. for find the disk with problems
# dmesg | grep sd*                            // Display kernel log for find the disk with problems

# badblocks -vs /dev/sdX                      // Highligh your old disk in the bay

# mdadm --detail /dev/md127                   // Check
# mdadm --manage /dev/md127 --fail /dev/sdX   // Fail the disk
# mdadm --manage /dev/md127 --remove /dev/sdX // SOFTWARE remove the disk from RAID

➜ CHANGE HARDEWARELLY THE DISK

# dmesg                                       // Check the NEW letter of the new disk
# sfdisk -d /dev/sdY | sfdisk /dev/sdX        // Then copy the partition from a healthy disk from the same RAID to the new one
# mdadm --add /dev/md127 /dev/sdc             // And add it to the RAID
# cat /proc/mdstat                            // Check

Somme exemples

Here, some exemples that show a future crash:

Display kernel system logs:

# dmesg | grep sd*
[23806611.537971] sd 0:0:3:0: [sdc] tag#238 Sense Key : Recovered Error [current]
[23806611.538821] sd 0:0:3:0: [sdc] tag#238 Add. Sense: Recovered data with linking
[23806887.066626] sd 0:0:3:0: [sdc] tag#7   Sense Key : Recovered Error [current]
[23806887.067510] sd 0:0:3:0: [sdc] tag#7   Add. Sense: Recovered data with linking

Display informations about SCSI devices:

# lsscsi -s
[...]
[0:0:1:0]    disk    HP       AB0300CDEFG      HPD3  /dev/sda    300GB
[0:0:2:0]    disk    HP       AB0300CDEFG      HPD3  /dev/sdb    300GB
[0:0:3:0]    disk    HP       AB0600CDHIJ      HPD2  /dev/sdc    600GB
[0:0:4:0]    disk    HP       AB0600CDHIJ      HPD2  /dev/sdd    600GB
[0:0:5:0]    disk    HP       AB0600CDHIJ      HPD2  /dev/sde    600GB
[0:0:6:0]    disk    HP       AB0600CDHIJ      HPD2  /dev/sdf    600GB
[...]

Display informations about disks partitions:

# lsblk
[
NAME                 MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINT
sda                    8:0    0 279.4G  0 disk
├─sda1                 8:1    0   400M  0 part
│ └─md0                9:0    0   400M  0 raid1  /boot
└─sda2                 8:2    0   279G  0 part
  └─md1                9:1    0 278.9G  0 raid1
    ├─rootvg-rootvol 253:0    0 275.9G  0 lvm    /
    └─rootvg-swapvol 253:1    0     3G  0 lvm    [SWAP]
sdb                    8:16   0 279.4G  0 disk
├─sdb1                 8:17   0   400M  0 part
│ └─md0                9:0    0   400M  0 raid1  /boot
└─sdb2                 8:18   0   279G  0 part
  └─md1                9:1    0 278.9G  0 raid1
    ├─rootvg-rootvol 253:0    0 275.9G  0 lvm    /
    └─rootvg-swapvol 253:1    0     3G  0 lvm    [SWAP]]
sdc                    8:32   0 558.9G  0 disk
└─md127                9:127  0   1.1T  0 raid10
  └─md127p1          259:0    0   1.1T  0 md     /users-folders
sdd                    8:48   0 558.9G  0 disk
└─md127                9:127  0   1.1T  0 raid10
  └─md127p1          259:0    0   1.1T  0 md     /users-folders
sde                    8:64   0 558.9G  0 disk
└─md127                9:127  0   1.1T  0 raid10
  └─md127p1          259:0    0   1.1T  0 md     /users-folders
sdf                    8:80   0 558.9G  0 disk
└─md127                9:127  0   1.1T  0 raid10
  └─md127p1          259:0    0   1.1T  0 md     /users-folders

Display disks and RAID:

# cat /proc/mdstat
Personalities : [raid1] [raid10]
md127 : active raid10 sdd[1] sde[2] sdc[0] sdf[3]
      1171860480 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 2/9 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[1] sda2[0]
      292426752 blocks super 1.2 [2/2] [UU]
      bitmap: 2/3 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[1] sda1[0]
      409536 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk
        
unused devices: <none>

Display S.M.A.R.T. disks informations (here sdc):

# smartctl -a /dev/sdc -d cciss,2

smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-348.el8.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HP
Product:              AB0600CDHIJ
Revision:             HPD2
Compliance:           SPC-4
User Capacity:        600,127,266,816 bytes [600 GB]
Logical block size:   512 bytes
Rotation Rate:        15052 rpm
Form Factor:          2.5 inches
Logical Unit id:      yyyyyyyyyyyyyyyyyy
Serial number:        xxxxxxxxxxxxxxxxxxxx
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri May 24 08:11:40 2024 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: HARDWARE IMPENDING FAILURE TOO MANY BLOCK REASSIGNS [asc=5d, ascq=14]

Current Drive Temperature:     38 C
Drive Trip Temperature:        60 C

Manufactured in week 01 of year 2016
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  42
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  2940
Elements in grown defect list: 8000

Manage RAID:

# mdadm --detail /dev/md127

/dev/md127:
           Version : 1.2
     Creation Time : Mon Jun  6 13:59:49 2016
        Raid Level : raid10
        Array Size : 1171860480 (1117.57 GiB 1199.99 GB)
     Used Dev Size : 585930240 (558.79 GiB 599.99 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri May 24 09:06:38 2024
             State : clean
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : 2
              UUID : aaaaaaaa:bbbbbbbb:cccccccc:dddddddd
            Events : 193414

    Number   Major   Minor   RaidDevice State
       0       8       32        0      active sync set-A   /dev/sdc
       1       8       48        1      active sync set-B   /dev/sdd
       2       8       64        2      active sync set-A   /dev/sde
       3       8       80        3      active sync set-B   /dev/sdf

How to change a RAID disk ?

Procedure to change un RAID disk before it chrash:

# cat /proc/mdstat  
# fdisk -l

Highligh the disk to find where is it:

# badblocks -vs /dev/sdc (for highligh)

Remove the disk fron RAID and wait a litle:

# mdadm --detail /dev/md127
# mdadm --manage /dev/md127 --fail /dev/sdc
# mdadm --manage /dev/md127 --remove /dev/sdc

➜ CHANGE THE DISK

Check the new letter of the new disk (in general it is the same), you can do that with these commands :

# cat /proc/mdstat  
# fdisk -l
# dmesg
# Journalctl

Here with dmesg:

# dmesg
[24283331.379917] sd 0:0:8:0: [sdc] Attached SCSI disk

Then copy the partition from a healthy disk from the RAID to the new one:

# sfdisk -d /dev/sdd | sfdisk /dev/sdc

And add it to the RAID:

# mdadm --add /dev/md127 /dev/sdc

If ok you will see sonething like that:

# cat /proc/mdstat
[...]
[=============>.......]recovery = 66.6%
[...]

And finaly:

Personalities : [raid1][raid10]
md127 : active raid10 sdc[4]sdd[1]sde[2]sdf[3]
      1171860480 blocks super 1.2 512K chunks 2 near-copies [4/4][UUUU]
      bitmap: 3/9 pages [12KB], 65536KB chunk

Documentation

MAN & Internet