LSI-SAS3008을 통해 컨트롤러에 5개의 jbod가 연결되어 있습니다. 저는 Arch-Linux 4.14.41-1-lts 및 multipath-tools v0.7.6(2018년 3월 10일)을 사용하고 있습니다.
내 문제는 디스크에 I/O 오류가 발생하고 깜박이기 시작하면 다중 경로가 디스크를 확인하고 실패한 경로를 다시 매핑하려고 한다는 것입니다.
Jul 23 04:59:51 FKM1 multipathd[5315]: 35000c50093d4e7c7: sdbe - tur checker timed out
Jul 23 04:59:51 FKM1 multipathd[5315]: checker failed path 67:128 in map 35000c50093d4e7c7
Jul 23 04:59:51 FKM1 multipathd[5315]: 35000c50093d4e7c7: remaining active paths: 0
Jul 23 04:59:51 FKM1 multipathd[5315]: sdbe: mark as failed
Jul 23 04:59:56 FKM1 multipathd[5315]: checker failed path 67:128 in map 35000c50093d4e7c7
Jul 23 05:04:37 FKM1 multipathd[5315]: 67:128: reinstated
Jul 23 05:04:37 FKM1 multipathd[5315]: 35000c50093d4e7c7: remaining active paths: 1
Jul 23 05:05:27 FKM1 multipathd[5315]: 35000c50093d4e7c7: sdbe - tur checker timed out
Jul 23 05:05:27 FKM1 multipathd[5315]: checker failed path 67:128 in map 35000c50093d4e7c7
Jul 23 05:05:27 FKM1 multipathd[5315]: 35000c50093d4e7c7: remaining active paths: 0
Jul 23 05:05:27 FKM1 multipathd[5315]: sdbe: mark as failed
잘못된 디스크 다중 경로로 인해 디스크가 나타날 때마다 다시 매핑을 시도합니다.
[Fri Aug 3 00:18:37 2018] alua: device handler registered
[Fri Aug 3 00:18:37 2018] emc: device handler registered
[Fri Aug 3 00:18:37 2018] rdac: device handler registered
[Fri Aug 3 00:18:37 2018] device-mapper: uevent: version 1.0.3
[Fri Aug 3 00:18:37 2018] device-mapper: ioctl: 4.37.0-ioctl (2017-09-20) initialised: [email protected]
[Fri Aug 3 00:18:43 2018] device-mapper: multipath service-time: version 0.3.0 loaded
[Fri Aug 3 00:18:43 2018] device-mapper: table: 254:0: multipath: error getting device
[Fri Aug 3 00:18:43 2018] device-mapper: ioctl: error adding target to table
[Fri Aug 3 00:18:43 2018] device-mapper: table: 254:0: multipath: error getting device
[Fri Aug 3 00:18:43 2018] device-mapper: ioctl: error adding target to table
[Fri Aug 3 00:21:19 2018] sd 12:0:16:0: attempting task abort! scmd(ffffa03a6c4de948)
[Fri Aug 3 00:21:19 2018] sd 12:0:16:0: [sdbh] tag#1 CDB: opcode=0x88 88 00 00 00 00 02 ba a0 f0 00 00 00 02 00 00 00
[Fri Aug 3 00:21:19 2018] scsi target12:0:16: handle(0x001c), sas_address(0x5000c50093d5135d), phy(8)
[Fri Aug 3 00:21:19 2018] scsi target12:0:16: enclosure_logical_id(0x500304800929f87f), slot(8)
[Fri Aug 3 00:21:19 2018] scsi target12:0:16: enclosure level(0x0001),connector name(1 )
[Fri Aug 3 00:21:19 2018] sd 12:0:16:0: task abort: SUCCESS scmd(ffffa03a6c4de948)
[Fri Aug 3 00:21:19 2018] sd 12:0:16:0: attempting task abort! scmd(ffffa07b2eb87d48)
[Fri Aug 3 00:21:19 2018] sd 12:0:16:0: [sdbh] tag#0 CDB: opcode=0x88 88 00 00 00 00 02 ba a0 f0 00 00 00 02 00 00 00
[Fri Aug 3 00:21:19 2018] scsi target12:0:16: handle(0x001c), sas_address(0x5000c50093d5135d), phy(8)
[Fri Aug 3 00:21:19 2018] scsi target12:0:16: enclosure_logical_id(0x500304800929f87f), slot(8)
[Fri Aug 3 00:21:19 2018] scsi target12:0:16: enclosure level(0x0001),connector name(1 )
[Fri Aug 3 00:21:19 2018] sd 12:0:16:0: task abort: SUCCESS scmd(ffffa07b2eb87d48)
[Fri Aug 3 00:21:21 2018] device-mapper: multipath: Failing path 67:176.
[Fri Aug 3 00:21:21 2018] sd 12:0:16:0: attempting task abort! scmd(ffffa03a89b38148)
[Fri Aug 3 00:21:21 2018] sd 12:0:16:0: [sdbh] tag#11 CDB: opcode=0x0 00 00 00 00 00 00
[Fri Aug 3 00:21:21 2018] scsi target12:0:16: handle(0x001c), sas_address(0x5000c50093d5135d), phy(8)
[Fri Aug 3 00:21:21 2018] scsi target12:0:16: enclosure_logical_id(0x500304800929f87f), slot(8)
[Fri Aug 3 00:21:21 2018] scsi target12:0:16: enclosure level(0x0001),connector name(1 )
[Fri Aug 3 00:21:21 2018] sd 12:0:16:0: task abort: SUCCESS scmd(ffffa03a89b38148)
[Fri Aug 3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 11721044480
[Fri Aug 3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 0
[Fri Aug 3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 512
[Fri Aug 3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 11721043968
[Fri Aug 3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 11721044480
[Fri Aug 3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 0
[Fri Aug 3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 512
[Fri Aug 3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 11721043968
[Fri Aug 3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 11721044480
[Fri Aug 3 00:21:57 2018] sd 12:0:16:0: attempting task abort! scmd(ffffa03a89b3f148)
얼마 후 MPT3SAS 드라이버가 포기하고 LSI 카드 재설정을 준비하면 루프가 계속됩니다.
[Fri Aug 3 00:18:12 2018] mpt3sas_cm3: iomem(0x00000000fbe40000), mapped(0xffffbe0e8dca0000), size(65536)
[Fri Aug 3 00:18:12 2018] mpt3sas_cm3: ioport(0x000000000000e000), size(256)
[Fri Aug 3 00:18:12 2018] usb 2-1-port6: over-current condition
[Fri Aug 3 00:18:12 2018] mpt3sas_cm3: sending message unit reset !!
[Fri Aug 3 00:18:12 2018] mpt3sas_cm3: message unit reset: SUCCESS
[Fri Aug 3 00:18:12 2018] mpt3sas_cm3: Allocated physical memory: size(20778 kB)
[Fri Aug 3 00:18:12 2018] mpt3sas_cm3: Current Controller Queue Depth(9564),Max Controller Queue Depth(9664)
[Fri Aug 3 00:18:12 2018] mpt3sas_cm3: Scatter Gather Elements per IO(128)
[Fri Aug 3 00:18:12 2018] usb 3-14.1: new low-speed USB device number 3 using xhci_hcd
[Fri Aug 3 00:18:12 2018] mpt3sas_cm3: LSISAS3008: FWVersion(15.00.02.00), ChipRevision(0x02), BiosVersion(08.35.00.00)
[Fri Aug 3 00:18:12 2018] mpt3sas_cm3: Protocol=(
[Fri Aug 3 00:18:12 2018] Initiator
[Fri Aug 3 00:18:12 2018] ,Target
[Fri Aug 3 00:18:12 2018] ),
[Fri Aug 3 00:18:12 2018] Capabilities=(
[Fri Aug 3 00:18:12 2018] TLR
[Fri Aug 3 00:18:12 2018] ,EEDP
[Fri Aug 3 00:18:12 2018] ,Snapshot Buffer
[Fri Aug 3 00:18:12 2018] ,Diag Trace Buffer
[Fri Aug 3 00:18:12 2018] ,Task Set Full
[Fri Aug 3 00:18:12 2018] ,NCQ
[Fri Aug 3 00:18:12 2018] )
[Fri Aug 3 00:18:12 2018] scsi host13: Fusion MPT SAS Host
[Fri Aug 3 00:18:12 2018] mpt3sas_cm3: sending port enable !!
[Fri Aug 3 00:18:12 2018] mpt3sas_cm4: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (528262416 kB)
[Fri Aug 3 00:18:12 2018] mpt3sas_cm3: host_add: handle(0x0001), sas_addr(0x500605b00c482a80), phys(8)
[Fri Aug 3 00:18:12 2018] mpt3sas_cm3: expander_add: handle(0x0009), parent(0x0001), sas_addr(0x5003048017aed57f), phys(38)
[Fri Aug 3 00:18:12 2018] scsi 13:0:0:0: Direct-Access SEAGATE ST800FM0173 0007 PQ: 0 ANSI: 6
Mpt3sas가 "진단 재설정"을 보내는 것은 동시에 jbod "90 디스크"를 잃어버렸다는 의미입니다! 따라서 단순한 디스크 오류로 인해 ZFS 풀이 중단될 수 있습니다.
이제 해결책을 찾고 있는데 다중 경로에서 "디스크가 3번 실패하면 다시 매핑하지 마십시오"라고 말하면 풀에서 디스크를 사용하지 않고 내 풀에서 사용하지 않기 때문에 문제가 해결될 것이라고 생각합니다. 결함이 있는 디스크를 사용하지 마십시오. 그러면 디스크에서 I/O 오류가 발생하지 않습니다.
그래서 간단한 설명으로 고장난 디스크의 사용을 불가능하게 하는 방법을 찾아보겠습니다.
/etc/multipath.conf에는 설정이 거의 없다는 것을 알았지만 이것이 내 문제를 해결하는지 잘 모르겠습니다. 내 문제에 대한 최선의 해결책을 말해 줄 수 있습니까?
defaults {
user_friendly_names no
path_grouping_policy failover
polling_interval 10
path_selector "round-robin 0"
path_grouping_policy failover
path_checker readsector0
failback manual
no_path_retry 3
prio rdac
}
blacklist_exceptions {
property "(ID_WWN|SCSI_IDENT_.*|ID_SERIAL)"
}
이것은 전체 DMESG 로그입니다 -->https://paste.ubuntu.com/p/XZZ2CScmHP/
답변1
이러한 SCSI 명령을 중단하는 것은 다중 경로가 아니라 Linux 커널입니다. 일시중지되면 제때에 처리할 수 없습니다.SCSI 오류 처리디스크를 복구하기 위해 점점 더 많은 콘텐츠를 시작하고 점차적으로 재설정합니다(HBA 재설정까지). 어떻게든 디스크가 더 빨리 만료되었음을 선언하도록 Linux를 설득해야 합니다.
다음을
udev
줄이는 규칙을 작성할 수도 있습니다.timeout
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/online_storage_reconfiguration_guide/task_controlling-scsi-command-timer-onlined-devices따라서 방금 오프라인으로 선언되었지만 많은 실험이 필요할 수 있습니다(위험은 이것이 모든 경로에서 작동할 수 있다는 것입니다).