Debian: MPI 코드 - Intel 컴파일러 - [하드웨어 오류]: 통합 메모리 컨트롤러 오류: DRAM ECC 오류

Debian: MPI 코드 - Intel 컴파일러 - [하드웨어 오류]: 통합 메모리 컨트롤러 오류: DRAM ECC 오류

를 사용하여 컴파일된 실행 파일을 실행할 때 intel mpiicc실행 30분 후에 다음 오류가 발생합니다.

 kernel:[29585.573874] [Hardware Error]: Corrected error, no action required.

Message from syslogd@pablo at Nov  8 09:53:25 ...
 kernel:[29585.573881] [Hardware Error]: CPU:2 (17:31:0) MC18_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2041000000011b

Message from syslogd@pablo at Nov  8 09:53:25 ...
 kernel:[29585.573887] [Hardware Error]: Error Addr: 0x0000000a6c12d280

Message from syslogd@pablo at Nov  8 09:53:25 ...
 kernel:[29585.573888] [Hardware Error]: IPID: 0x0000009600550f00, Syndrome: 0xc54c00040a800611

Message from syslogd@pablo at Nov  8 09:53:25 ...
 kernel:[29585.573891] [Hardware Error]: Unified Memory Controller Extended Error Code: 0

Message from syslogd@pablo at Nov  8 09:53:25 ...
 kernel:[29585.573893] [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.

Message from syslogd@pablo at Nov  8 09:53:25 ...
 kernel:[29585.573895] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

저는 AMD EPYC 7702P 64-Core Processor1TB RAM과 Debian OS에서 개발 중입니다.

Linux pablo 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 GNU/Linux

내가 볼 수 있는 한, dmidecode -t memory다음 명령을 실행했습니다.

# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2.0 present.

Handle 0x0023, DMI type 16, 23 bytes
Physical Memory Array
    Location: System Board Or Motherboard
    Use: System Memory
    Error Correction Type: Multi-bit ECC
    Maximum Capacity: 2 TB
    Error Information Handle: 0x0022
    Number Of Devices: 8

Handle 0x002B, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x002A
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL A
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F701
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x002E, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x002D
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL B
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F3ED
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x0031, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x0030
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL C
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F4BA
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x0034, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x0033
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL D
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F396
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x0037, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x0036
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL E
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F67D
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x003A, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x0039
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL F
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F394
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x003D, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x003C
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL G
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F48A
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

Handle 0x0040, DMI type 17, 84 bytes
Memory Device
    Array Handle: 0x0023
    Error Information Handle: 0x003F
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 128 GB
    Form Factor: DIMM
    Set: None
    Locator: DIMM 0
    Bank Locator: P0 CHANNEL H
    Type: DDR4
    Type Detail: Synchronous Registered (Buffered) LRDIMM
    Speed: 2933 MT/s
    Manufacturer: Samsung
    Serial Number: 03C6F3FB
    Asset Tag: Not Specified
    Part Number: M386AAG40MMB-CVF
    Rank: 4
    Configured Memory Speed: 2933 MT/s
    Minimum Voltage: 1.2 V
    Maximum Voltage: 1.2 V
    Configured Voltage: 1.2 V
    Memory Technology: DRAM
    Memory Operating Mode Capability: Volatile memory
    Firmware Version: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Product ID: Unknown
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Non-Volatile Size: None
    Volatile Size: 128 kB
    Cache Size: None
    Logical Size: None

DRAM ECC error이것이 어디서 왔는지 모르겠습니다 . 어쩌면 내 마더보드, CPU 모델 또는 버전이 호환되지 않을 수도 있나요 Intel compiler SDK?

이러한 오류는 실행 중 약 5분마다 발생합니다.

저는 Intel 컴파일러 버전을 사용하고 있습니다 compilers_and_libraries_2020.1.217.

공식 Open-MPI Debian 10 저장소 버전에서 MPI를 사용하여 컴파일할 때도 동일한 오류 메시지가 나타났습니다.

BIOS에 변경해야 할 옵션이 있지만 확실하지 않습니다.

누구든지 이 문제를 해결할 아이디어가 있다면 알려주는 것이 좋습니다.

답변1

메모리에 문제가 있는 것 같습니다. 하드웨어 문제입니다. 오랫동안 memtest를 실행하거나 메모리 스틱을 교체하고 애플리케이션을 다시 시도해 보는 것이 좋습니다. 응용프로그램이 결함이 있는 섹터에 액세스하기 위해 너무 많은 메모리를 할당했을 수 있습니다.

관련 정보