mdadm DegradedArray, 소프트웨어 문제입니까 아니면 하드웨어 결함입니까?

2024-6-9 • tag-icon

mdadm DegradedArray, 소프트웨어 문제입니까 아니면 하드웨어 결함입니까?

내 호스팅 제공업체의 전용 서버에서 모든 RAID 어레이 md0/md1/md2에 대해 다음 이메일을 받았습니다.

This is an automatically generated mail message from mdadm running on cn.com
> `This is an automatically generated mail message from mdadm running on
> example.com
> 
> A DegradedArray event had been detected on md device /dev/md/2.
> 
> Faithfully yours, etc.
> 
> P.S. The /proc/mdstat file currently contains the following:
> 
> Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5]
> [raid4] [raid10] md2 : active raid1 nvme0n1p3[0]
>       903479616 blocks super 1.2 [2/1] [U_]
>       bitmap: 7/7 pages [28KB], 65536KB chunk
> 
> md0 : active raid1 nvme0n1p1[0]
>       33520640 blocks super 1.2 [2/1] [U_]
>        md1 : active raid1 nvme0n1p2[0]
>       523264 blocks super 1.2 [2/1] [U_]
>        unused devices: <none> `

이것이 RAID 동기화 문제인지, 아니면 하드 드라이브에 실제로 결함이 있는지는 모르겠습니다. Linux 전문가의 도움을 바랍니다.

두 개의 NVME Samsung 장치가 소프트웨어 raid mdadm으로 실행되고 있습니다.

$ fdisk -l
nvme1n1     259:0    0 894.3G  0 disk
├─nvme1n1p1 259:2    0    32G  0 part
├─nvme1n1p2 259:3    0   512M  0 part
└─nvme1n1p3 259:4    0 861.8G  0 part
nvme0n1     259:1    0 894.3G  0 disk
├─nvme0n1p1 259:5    0    32G  0 part
│ └─md0       9:0    0    32G  0 raid1 [SWAP]
├─nvme0n1p2 259:6    0   512M  0 part
│ └─md1       9:1    0   511M  0 raid1 /boot
└─nvme0n1p3 259:7    0 861.8G  0 part
  └─md2       9:2    0 861.6G  0 raid1 /

목록에서 볼 수 있듯이 nvme1n1과 해당 파티션은 raid 그룹에 없습니다. 분명히 nvme1n1은 운영 체제에서도 인식됩니다.

$ dmesg 
[ 7664.380493] pcieport 0000:00:1b.4: AER: Corrected error received: 0000:00:1b.4
[ 7664.380514] pcieport 0000:00:1b.4: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 7664.380795] pcieport 0000:00:1b.4: AER:   device [8086:a32c] error status/mask=00000001/00002000
[ 7664.381066] pcieport 0000:00:1b.4: AER:    [ 0] RxErr
[ 7664.780438] pcieport 0000:00:1b.4: AER: Corrected error received: 0000:00:1b.4
[ 7664.780459] pcieport 0000:00:1b.4: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 7664.780739] pcieport 0000:00:1b.4: AER:   device [8086:a32c] error status/mask=00000001/00002000
[ 7664.781011] pcieport 0000:00:1b.4: AER:    [ 0] RxErr

lspci두 개의 NVME 장치 표시

$lspci
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983

예를 들어 md0에 대한 mdadm 세부 정보를 확인하세요.

mdadm -D /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Sat Aug  7 19:34:45 2021
        Raid Level : raid1
        Array Size : 33520640 (31.97 GiB 34.33 GB)
     Used Dev Size : 33520640 (31.97 GiB 34.33 GB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

       Update Time : Fri Mar  4 17:42:37 2022
             State : clean, degraded
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : rescue:0
              UUID : 2e61cb41:dee3a004:b12de575:72c13ed0
            Events : 46

    Number   Major   Minor   RaidDevice State
       0     259        2        0      active sync   /dev/nvme0n1p1
       -       0        0        1      removed

여기에는 /dev/nvme1n1p1 장치가 표시되지 않습니다. 이것이 나에게 무엇을 의미합니까?

내 mdadm.conf 파일

# mdadm.conf
#
# !NB! Run update-initramfs -u after updating this file.
# !NB! This will ensure that initramfs has an uptodate copy.
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default (built-in), scan all partitions (/proc/partitions) and all
# containers for MD superblocks. alternatively, specify devices to scan, using
# wildcards if desired.
#DEVICE partitions containers

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays
ARRAY /dev/md/0  metadata=1.2 UUID=2e61cb41:dee3a004:b12de575:72c13ed0 name=rescue:0
ARRAY /dev/md/1  metadata=1.2 UUID=455ba7de:599eb665:202c1fe8:33c709f4 name=rescue:1
ARRAY /dev/md/2  metadata=1.2 UUID=c1f88478:e4ed5e8d:56f296cc:38e97b8c name=rescue:2
ARRAY /dev/md/0  metadata=1.2 UUID=e8c8f0cb:91007124:62e03226:94a707dc name=rescue:0
ARRAY /dev/md/1  metadata=1.2 UUID=a335efb7:cc52634c:3221294c:e7feb748 name=rescue:1
ARRAY /dev/md/2  metadata=1.2 UUID=f2a13b49:17f5e812:8e7c5adf:3114a929 name=rescue:2

# This configuration was auto-generated on Sat, 07 Aug 2021 19:35:14 +0200 by mkconf

당신이 나를 도울 수 있기를 바랍니다

답변1

이는 하드웨어 수준 오류입니다. 호스팅 서버이므로 공급자에게 결함이 있는 장비를 교체하도록 요청하십시오. 고치려고 애쓰지 말고 그냥 교체하세요. 이것이 당신이 지불하는 것입니다.

호스팅 제공업체와 가동 중지 시간을 예약해야 합니다.
어떤 디스크 장치에 결함이 있는지 100% 확신하는지 확인하십시오. (나는 더 잘 알아야 할 공급업체로부터 이전에 좋은 디스크를 교체했습니다. 운 좋게도 나는 RAID6을 실행하고 있으며 두 번째 "실패"를 처리할 수 있습니다)
가능하다면 "만약" 문제가 발생할 경우를 대비하여 백업을 만드십시오. 어쨌든 백업이 있어야 하므로 추가 복사본을 확보하세요.

답변1

관련 정보