Dell Poweredge T105에 OCZ-ARC100을 설치했습니다. 시스템(CentOS 7)을 부팅하면 후자에 BDMA 오류가 표시됩니다.
jun 25 15:40:21 myhost kernel: ata4.00: ATA-8: OCZ-ARC100, 1.01, max UDMA/133
jun 25 15:40:21 myhost kernel: ata4.00: 234441648 sectors, multi 1: LBA48 NCQ (depth 0/32)
jun 25 15:40:21 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:21 myhost kernel: scsi 3:0:0:0: Direct-Access ATA OCZ-ARC100 1.01 PQ: 0 ANSI: 5
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] 234441648 512-byte logical blocks: (120 GB/111 GiB)
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Write Protect is off
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Mode Sense: 00 3a 00 00
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
jun 25 15:40:21 myhost kernel: sda: sda1 sda2 sda3
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Attached SCSI disk
jun 25 15:40:21 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
jun 25 15:40:21 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:21 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:21 myhost kernel: ata4.00: cmd c8/00:08:00:4b:f9/00:00:00:00:00/ed tag 0 dma 4096 in
res 51/04:08:00:4b:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:21 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:21 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:21 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:21 myhost kernel: ata4: EH complete
...
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:22 myhost kernel: ata4: EH complete
jun 25 15:40:22 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
jun 25 15:40:22 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:22 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:22 myhost kernel: ata4.00: cmd c8/00:08:d0:47:f9/00:00:00:00:00/ed tag 0 dma 4096 in
res 51/04:08:d0:47:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:22 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:22 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:22 myhost kernel: ata4: EH complete
jun 25 15:40:22 myhost kernel: ata4.00: limiting speed to UDMA/100:PIO4
jun 25 15:40:22 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
jun 25 15:40:22 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:22 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:22 myhost kernel: ata4.00: cmd c8/00:08:f8:47:f9/00:00:00:00:00/ed tag 0 dma 4096 in
res 51/04:08:f8:47:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:22 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:22 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:22 myhost kernel: ata4: hard resetting link
jun 25 15:40:22 myhost kernel: ata4: nv: skipping hardreset on occupied port
jun 25 15:40:22 myhost kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/100
jun 25 15:40:22 myhost kernel: ata4: EH complete
OCZ를 SATA-USB2 어댑터에 연결하고 smartctrl을 실행했습니다.
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.6-gentoo-nvidia] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: OCZ-ARC100
Serial Number: A22L0061518000567
LU WWN Device Id: 5 e83a97 100061d69
Firmware Version: 1.01
User Capacity: 120.034.123.776 bytes [120 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sun Jun 25 15:28:55 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x1d) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x00) Error logging NOT supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 0) minutes.
Extended self-test routine
recommended polling time: ( 0) minutes.
SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0000 000 000 000 Old_age Offline - 0
9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 252
12 Power_Cycle_Count 0x0000 100 100 000 Old_age Offline - 84
171 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 39711824
174 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 10
195 Hardware_ECC_Recovered 0x0000 100 100 000 Old_age Offline - 0
196 Reallocated_Event_Count 0x0000 100 100 000 Old_age Offline - 0
197 Current_Pending_Sector 0x0000 100 100 000 Old_age Offline - 0
208 Unknown_SSD_Attribute 0x0000 100 100 000 Old_age Offline - 5
210 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
224 Unknown_SSD_Attribute 0x0000 100 100 000 Old_age Offline - 1
233 Media_Wearout_Indicator 0x0000 100 100 000 Old_age Offline - 100
241 Total_LBAs_Written 0x0000 100 100 000 Old_age Offline - 92
242 Total_LBAs_Read 0x0000 100 100 000 Old_age Offline - 221
249 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 3316691
SMART Error Log Version: 1
No Errors Logged
Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Selective Self-tests/Logging not supported
여기에는 분명히 오류의 흔적이 없습니다. BMDMA 오류에 대해서는 별 관심을 두지 않았지만, 처음에는 드라이브가 죽는 줄 알았는데, 지금은 이것이 올바른 진단인지 궁금합니다. 또한 드라이브를 새 제품(Western Digital Blue 500GB)으로 교체하면 오류 없이 작동한다는 사실에 오해를 받았습니다. 그러나 차이점은 OCZ가 실제로 비교해 볼 때 엄청나게 빠르다는 것입니다.
위의 오류(분명히 DMA 오류)를 어떻게 설명해야 하며 이 문제를 어떻게 해결할 수 있습니까? 예를 들어 플래시 OCZ 펌웨어? 특정 커널 매개변수를 사용하시겠습니까?
그런데 BIOS는 ATA
SATA 디스크가 버스 옵션을 사용하도록 강제합니다. 예를 들어 AHCI로 변경할 수 없습니다. 이는 SATA 버스에 연결된 CD/DVD 드라이브 또는 Fusion MPT 하드웨어 Raid 어댑터 때문이라고 생각됩니다. 어쨌든 여기서는 (말 그대로) 선택의 여지가 없지만 적어도 WD 드라이브의 경우에는 문제가 되지 않는 것 같습니다.
편집하다:서버 자체에서 드라이브 자체 테스트를 실행했는데 결과는 다음과 같습니다.
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.21.1.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0000 000 000 000 Old_age Offline - 0
9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 253
12 Power_Cycle_Count 0x0000 100 100 000 Old_age Offline - 85
171 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 39711824
174 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 10
195 Hardware_ECC_Recovered 0x0000 100 100 000 Old_age Offline - 0
196 Reallocated_Event_Count 0x0000 100 100 000 Old_age Offline - 0
197 Current_Pending_Sector 0x0000 100 100 000 Old_age Offline - 0
208 Unknown_SSD_Attribute 0x0000 100 100 000 Old_age Offline - 5
210 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
224 Unknown_SSD_Attribute 0x0000 100 100 000 Old_age Offline - 1
233 Media_Wearout_Indicator 0x0000 100 100 000 Old_age Offline - 100
241 Total_LBAs_Written 0x0000 100 100 000 Old_age Offline - 92
242 Total_LBAs_Read 0x0000 100 100 000 Old_age Offline - 222
249 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 3316768
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 253 -
게다가, 그리고 다음과 같이그럼 팁smartctl은 드라이브 내부를 테스트하며 드라이브에 결함이 없다고 안전하게 가정할 수 있다고 생각합니다. 좀 더 조사해보겠습니다...