커널 메시지에서 DIMM 카드 문제를 식별하는 방법

커널 메시지에서 DIMM 카드 문제를 식별하는 방법

RHEL 7 서버가 있고 dmesg 로그에서 다음 세부 정보를 볼 수 있습니다.

[13901018.980859] {9}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[13901018.980868] {9}[Hardware Error]: It has been corrected by h/w and requires no further action
[13901018.980870] {9}[Hardware Error]: event severity: corrected
[13901018.980872] {9}[Hardware Error]:  Error 0, type: corrected
[13901018.980873] {9}[Hardware Error]:  fru_text: A8
[13901018.980875] {9}[Hardware Error]:   section_type: memory error
[13901018.980876] {9}[Hardware Error]:   error_status: 0x0000000000000400
[13901018.980878] {9}[Hardware Error]:   physical_address: 0x0000000ffd6bb600
[13901018.980880] {9}[Hardware Error]:   node: 0 card: 3 module: 1 rank: 1 bank: 2 row: 30682 column: 728 
[13901018.980882] {9}[Hardware Error]:   error_type: 2, single-bit ECC
[13901018.980899] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[13901018.980901] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
[13901018.980903] EDAC sbridge MC0: TSC 89ad682bcacc05 
[13901018.980905] EDAC sbridge MC0: ADDR ffd6bb600 
[13901018.980906] EDAC sbridge MC0: MISC 0 
[13901018.980907] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1575370818 SOCKET 0 APIC 0
[13901019.271775] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#1_Chan#1_DIMM#1 (channel:5 slot:1 page:0xffd6bb offset:0x600 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:1 channel_mask:2 rank:5)
[13901059.217841] mce: [Hardware Error]: Machine check events logged
[13903720.090431] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[13903720.090435] {10}[Hardware Error]: It has been corrected by h/w and requires no further action
[13903720.090436] {10}[Hardware Error]: event severity: corrected
[13903720.090438] {10}[Hardware Error]:  Error 0, type: corrected
[13903720.090439] {10}[Hardware Error]:  fru_text: A8
[13903720.090440] {10}[Hardware Error]:   section_type: memory error
[13903720.090441] {10}[Hardware Error]:   error_status: 0x0000000000000400
[13903720.090442] {10}[Hardware Error]:   physical_address: 0x0000000ffe47b640
[13903720.090445] {10}[Hardware Error]:   node: 0 card: 3 module: 1 rank: 1 bank: 2 row: 30705 column: 728 
[13903720.090446] {10}[Hardware Error]:   error_type: 2, single-bit ECC
[13903720.090456] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[13903720.090458] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
[13903720.090459] EDAC sbridge MC0: TSC 89b2cfb1432fce 
[13903720.090460] EDAC sbridge MC0: ADDR ffe47b640 
[13903720.090461] EDAC sbridge MC0: MISC 0 

인터넷 검색을 통해 조광기 카드에 문제가 있는 것 같지만 여전히 확신할 수 없습니다.

위 커널 메시지에 대한 의견이 있으십니까?

기타 세부 정보 dmesg (단, 네트워크 드라이버 관련 및 DIMM 카드 관련 가능)

[81712386.762144] i40e 0000:82:00.0 p4p1: tx_timeout: VSI_seid: 395, Q 47, NTC: 0x19a, HWB: 0x19a, NTU: 0x182, TAIL: 0x19a, INT: 0x1
[81712386.762145] i40e 0000:82:00.0 p4p1: tx_timeout recovery level 1, hung_queue 47
[89254950.070885] traps: polkitd[111181] general protection ip:7f4d643b8cf2 sp:7fff401879c0 error:0 in libmozjs-17.0.so[7f4d6427a000+3b3000]
[90620196.068233] INFO: task kworker/15:2:76449 blocked for more than 120 seconds.
[90620196.068237] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[90620196.068239] kworker/15:2    D ffff88027c533dd8     0 76449      2 0x00000080
[90620196.068247]  ffff88027c533bf0 0000000000000046 ffff8826eff68000 ffff88027c533fd8
[90620196.068249]  ffff88027c533fd8 ffff88027c533fd8 ffff8826eff68000 ffff88027c533d58
[90620196.068251]  ffff88027c533d60 7fffffffffffffff ffff8826eff68000 ffff88027c533dd8

관련 정보