시스템에서 EDAC 오류가 발생할 때마다 스크립트를 호출해야 합니다.
이를 위해 다음과 같은 UDEV 규칙을 만들었습니다. 변경 사항이 발생 하면 ce_count
실행하고 싶은데 실행 /var/tmp/test.sh
했는데 오류가 발생했지만 스크립트가 실행되지 않았습니다.udevadm control --reload-rules && udevadm trigger
udevadm monitor
mce-inect
# cat /etc/udev/rules.d/98-edac.rules
ACTION=="change", ATTR{ce_count}, KERNEL=="mc0", RUN+="/var/tmp/test.sh"
# udevadm info -ap /sys/devices/system/edac/mc/mc0
Udevadm info starts with the device specified by the devpath and then
walks up the chain of parent devices. It prints for every device
found, all possible attributes in the udev rules key format.
A rule to match, can be composed by the attributes of the device
and the attributes from one single parent device.
looking at device '/devices/system/edac/mc/mc0':
KERNEL=="mc0"
SUBSYSTEM=="mc0"
DRIVER==""
ATTR{ce_count}=="21"
ATTR{ce_noinfo_count}=="0"
ATTR{max_location}=="channel 7 slot 2 "
ATTR{mc_name}=="Broadwell Socket#0"
ATTR{seconds_since_reset}=="5223"
ATTR{size_mb}=="65536"
ATTR{ue_count}=="0"
ATTR{ue_noinfo_count}=="0"
looking at parent device '/devices/system/edac/mc':
KERNELS=="mc"
SUBSYSTEMS=="edac"
DRIVERS==""
looking at parent device '/devices/system/edac':
KERNELS=="edac"
SUBSYSTEMS==""
DRIVERS==""
edac/mce 실패를 유도하기 위해 다음 방법을 사용했습니다 mce-inject
.
./mce-inject ./basic-inject.txt
# cat basic-inject.txt
CPU 0 BANK 8
STATUS corrected
ADDR 0x12345125
MCGCAP 0x7000c16
APICID 0
MCGSTATUS 0
SOCKETID 0
MISC 0x50683286
STATUS 0x8c00004000010090
삽입 오류 후 커널 syslog/dmesg에 로그 항목이 있음
[ +4.436747] Starting machine check poll CPU 0
[ +0.000013] mce: [Hardware Error]: Machine check events logged
[ +0.000008] Machine check poll done on CPU 0
[ +0.000030] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ +0.000002] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c00004000010090
[ +0.000001] EDAC sbridge MC0: TSC 0
[ +0.000002] EDAC sbridge MC0: ADDR 12345100
[ +0.000000] EDAC sbridge MC0: MISC 50683286
[ +0.000002] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1593625089 SOCKET 0 APIC 0
[ +0.000005] EDAC DEBUG: get_memory_error_data: SAD interleave package: 0 = CPU socket 0, HA 0, shiftup: 1
[ +0.000005] EDAC DEBUG: get_memory_error_data: TAD#0: address 0x0000000012345100 < 0x000000007fffffff, socket interleave 0, channel interleave 2 (offset 0x00000000), index 0, base ch: 2, ch mask: 0x04
[ +0.000007] EDAC DEBUG: get_memory_error_data: RIR#0, limit: 31.999 GB (0x00000007ffffffff), way: 4
[ +0.000002] EDAC DEBUG: get_memory_error_data: RIR#0: channel address 0x091a2880 < 0x7ffffffff, RIR interleave 2, index 1
[ +0.000002] EDAC DEBUG: sbridge_mce_output_error: area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:4 rank:4
[ +0.000007] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#1 (channel:2 slot:1 page:0x12345 offset:0x100 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:4 rank:4)
[Jul 1 17:41] perf: interrupt took too long (3923 > 3920), lowering kernel.perf_event_max_sample_rate to 50000