Change a RAID disk before it dies

Change a disk from a RAID software before it dies.

Check disks health

Useful commands:

# dmesg | grep sd*                  // Display kernel system logs
# lsscsi -s // Display informations about SCSI devices
# lsblk // Display informations about disks partitions
# smartctl -a /dev/sdX -d cciss,N // Display S.M.A.R.T. disks informations
# cat /proc/mdstat // Display disks and RAID
# mdadm --detail /dev/mdX // Manage RAID

TL;DR How to change a RAID disk

Change a disk from a RAID before it crash procedure:

# cat /proc/mdstat                            // Display disks and RAID
# fdisk -l // display informatoins
# lsblk // Show RAID ans disks' names
# smartctl -a /dev/sdX -d cciss,N // Display S.M.A.R.T. for find the disk with problems
# dmesg | grep sd* // Display kernel log for find the disk with problems

# badblocks -vs /dev/sdX // Highligh your old disk in the bay

# mdadm --detail /dev/md127 // Check
# mdadm --manage /dev/md127 --fail /dev/sdX // Fail the disk
# mdadm --manage /dev/md127 --remove /dev/sdX // SOFTWARE remove the disk from RAID

CHANGE HARDEWARELLY THE DISK

# dmesg                                       // Check the NEW letter of the new disk
# sfdisk -d /dev/sdY | sfdisk /dev/sdX // Then copy the partition from a healthy disk from the same RAID to the new one
# mdadm --add /dev/md127 /dev/sdc // And add it to the RAID
# cat /proc/mdstat // Check

Somme exemples

Here, some exemples that show a future crash:

Display kernel system logs:

# dmesg | grep sd*
[23806611.537971] sd 0:0:3:0: [sdc] tag#238 Sense Key : Recovered Error [current]
[23806611.538821] sd 0:0:3:0: [sdc] tag#238 Add. Sense: Recovered data with linking
[23806887.066626] sd 0:0:3:0: [sdc] tag#7 Sense Key : Recovered Error [current]
[23806887.067510] sd 0:0:3:0: [sdc] tag#7 Add. Sense: Recovered data with linking

Display informations about SCSI devices:

# lsscsi -s
[...]
[0:0:1:0] disk HP AB0300CDEFG HPD3 /dev/sda 300GB
[0:0:2:0] disk HP AB0300CDEFG HPD3 /dev/sdb 300GB
[0:0:3:0] disk HP AB0600CDHIJ HPD2 /dev/sdc 600GB
[0:0:4:0] disk HP AB0600CDHIJ HPD2 /dev/sdd 600GB
[0:0:5:0] disk HP AB0600CDHIJ HPD2 /dev/sde 600GB
[0:0:6:0] disk HP AB0600CDHIJ HPD2 /dev/sdf 600GB
[...]

Display informations about disks partitions:

# lsblk
[
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 279.4G 0 disk
├─sda1 8:1 0 400M 0 part
│ └─md0 9:0 0 400M 0 raid1 /boot
└─sda2 8:2 0 279G 0 part
└─md1 9:1 0 278.9G 0 raid1
├─rootvg-rootvol 253:0 0 275.9G 0 lvm /
└─rootvg-swapvol 253:1 0 3G 0 lvm [SWAP]
sdb 8:16 0 279.4G 0 disk
├─sdb1 8:17 0 400M 0 part
│ └─md0 9:0 0 400M 0 raid1 /boot
└─sdb2 8:18 0 279G 0 part
└─md1 9:1 0 278.9G 0 raid1
├─rootvg-rootvol 253:0 0 275.9G 0 lvm /
└─rootvg-swapvol 253:1 0 3G 0 lvm [SWAP]]
sdc 8:32 0 558.9G 0 disk
└─md127 9:127 0 1.1T 0 raid10
└─md127p1 259:0 0 1.1T 0 md /users-folders
sdd 8:48 0 558.9G 0 disk
└─md127 9:127 0 1.1T 0 raid10
└─md127p1 259:0 0 1.1T 0 md /users-folders
sde 8:64 0 558.9G 0 disk
└─md127 9:127 0 1.1T 0 raid10
└─md127p1 259:0 0 1.1T 0 md /users-folders
sdf 8:80 0 558.9G 0 disk
└─md127 9:127 0 1.1T 0 raid10
└─md127p1 259:0 0 1.1T 0 md /users-folders

Display disks and RAID:

# cat /proc/mdstat
Personalities : [raid1] [raid10]
md127 : active raid10 sdd[1] sde[2] sdc[0] sdf[3]
1171860480 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
bitmap: 2/9 pages [8KB], 65536KB chunk

md1 : active raid1 sdb2[1] sda2[0]
292426752 blocks super 1.2 [2/2] [UU]
bitmap: 2/3 pages [8KB], 65536KB chunk

md0 : active raid1 sdb1[1] sda1[0]
409536 blocks super 1.0 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>

Display S.M.A.R.T. disks informations (here sdc):

# smartctl -a /dev/sdc -d cciss,2

smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-348.el8.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: HP
Product: AB0600CDHIJ
Revision: HPD2
Compliance: SPC-4
User Capacity: 600,127,266,816 bytes [600 GB]
Logical block size: 512 bytes
Rotation Rate: 15052 rpm
Form Factor: 2.5 inches
Logical Unit id: yyyyyyyyyyyyyyyyyy
Serial number: xxxxxxxxxxxxxxxxxxxx
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Fri May 24 08:11:40 2024 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: HARDWARE IMPENDING FAILURE TOO MANY BLOCK REASSIGNS [asc=5d, ascq=14]

Current Drive Temperature: 38 C
Drive Trip Temperature: 60 C

Manufactured in week 01 of year 2016
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 42
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 2940
Elements in grown defect list: 8000

Manage RAID:

# mdadm --detail /dev/md127

/dev/md127:
Version : 1.2
Creation Time : Mon Jun 6 13:59:49 2016
Raid Level : raid10
Array Size : 1171860480 (1117.57 GiB 1199.99 GB)
Used Dev Size : 585930240 (558.79 GiB 599.99 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Fri May 24 09:06:38 2024
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0

Layout : near=2
Chunk Size : 512K

Consistency Policy : bitmap

Name : 2
UUID : aaaaaaaa:bbbbbbbb:cccccccc:dddddddd
Events : 193414

Number Major Minor RaidDevice State
0 8 32 0 active sync set-A /dev/sdc
1 8 48 1 active sync set-B /dev/sdd
2 8 64 2 active sync set-A /dev/sde
3 8 80 3 active sync set-B /dev/sdf

How to change a RAID disk ?

Procedure to change un RAID disk before it chrash:

# cat /proc/mdstat  
# fdisk -l

Highligh the disk to find where is it:

# badblocks -vs /dev/sdc (for highligh)

Remove the disk fron RAID and wait a litle:

# mdadm --detail /dev/md127
# mdadm --manage /dev/md127 --fail /dev/sdc
# mdadm --manage /dev/md127 --remove /dev/sdc

CHANGE THE DISK

Check the new letter of the new disk (in general it is the same), you can do that with these commands :

# cat /proc/mdstat  
# fdisk -l
# dmesg
# Journalctl

Here with dmesg:

# dmesg
[24283331.379917] sd 0:0:8:0: [sdc] Attached SCSI disk

Then copy the partition from a healthy disk from the RAID to the new one:

# sfdisk -d /dev/sdd | sfdisk /dev/sdc

And add it to the RAID:

# mdadm --add /dev/md127 /dev/sdc

If ok you will see sonething like that:

# cat /proc/mdstat
[...]
[=============>.......]recovery = 66.6%
[...]

And finaly:

Personalities : [raid1][raid10]
md127 : active raid10 sdc[4]sdd[1]sde[2]sdf[3]
1171860480 blocks super 1.2 512K chunks 2 near-copies [4/4][UUUU]
bitmap: 3/9 pages [12KB], 65536KB chunk

Documentation

MAN & Internet

> Partager <