radeon 오류: GPU 잠김: x 밀리초 이상 링 0에서 정지되었습니다.

radeon 오류: GPU 잠김: x 밀리초 이상 링 0에서 정지되었습니다.

Debian Buster가 새로 설치된 컴퓨터가 있습니다. GPU는 라데온입니다 FirePro W2100. 몇 시간 사용 후 갑자기 기기가 정지되고 디스플레이가 "백색 소음"으로 전환되어 기기를 사용할 수 없게 되었습니다.

로그에 다음과 같은 오류가 많이 표시됩니다.

kernel: radeon 0000:65:00.0: ring 0 stalled for more than 10240msec
kernel: radeon 0000:65:00.0: GPU lockup (current fence id 0x0000000000039bff last fence id 0x0000000000039c42 on ring 0)
kernel: adeon 0000:65:00.0: failed to get a new IB (-35)
kernel: [drm:ffffffff816219d0] *ERROR* Couldn't update BO_VA (-35)
kernel: radeon 0000:65:00.0: failed to get a new IB (-35)

그런 다음

kernel: radeon 0000:65:00.0: ring 0 stalled for more than 10032msec
kernel: radeon 0000:65:00.0: GPU lockup (current fence id 0x0000000000039bff last fence id 0x0000000000039c42 on ring 0)

이러한 오류는 무엇을 의미하며 어떻게 해결합니까?

하드웨어 문제인가요, 소프트웨어 문제인가요?

답변1

radeon 0000:04:00.0: ring 0 stalled for more than 10240msec내 것을 엿 먹어[AMD/ATI] RV620 GL [파이어프로 2450]아래 오페라 웹브라우저를 실행하면우분투 20.04.5 LTS몇 분. Firefox나 다른 프로그램에는 문제가 없으며 Opera만 문제가 됩니다.

[128524.943553] radeon 0000:04:00.0: ring 0 stalled for more than 10240msec
[128524.943565] radeon 0000:04:00.0: GPU lockup (current fence id 0x000000000029caf6 last fence id 0x000000000029cafc on ring 0)
[128524.955392] radeon 0000:04:00.0: Saved 185 dwords of commands on ring 0.
[128524.955409] radeon 0000:04:00.0: GPU softreset: 0x00000009
[128524.955413] radeon 0000:04:00.0:   R_008010_GRBM_STATUS      = 0xA2303030
[128524.955417] radeon 0000:04:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
[128524.955420] radeon 0000:04:00.0:   R_000E50_SRBM_STATUS      = 0x200010C0
[128524.955423] radeon 0000:04:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[128524.955426] radeon 0000:04:00.0:   R_008678_CP_STALLED_STAT2 = 0x00008002
[128524.955429] radeon 0000:04:00.0:   R_00867C_CP_BUSY_STAT     = 0x00008086
[128524.955432] radeon 0000:04:00.0:   R_008680_CP_STAT          = 0x80018645
[128524.955435] radeon 0000:04:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[128525.013038] radeon 0000:04:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEF
[128525.013097] radeon 0000:04:00.0: SRBM_SOFT_RESET=0x00000100
[128525.015187] radeon 0000:04:00.0:   R_008010_GRBM_STATUS      = 0xA0003030
[128525.015191] radeon 0000:04:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
[128525.015195] radeon 0000:04:00.0:   R_000E50_SRBM_STATUS      = 0x200080C0
[128525.015198] radeon 0000:04:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[128525.015201] radeon 0000:04:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[128525.015204] radeon 0000:04:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[128525.015207] radeon 0000:04:00.0:   R_008680_CP_STAT          = 0x80100000
[128525.015210] radeon 0000:04:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[128525.015220] radeon 0000:04:00.0: GPU reset succeeded, trying to resume
[128525.031584] [drm] PCIE gen 2 link speeds already enabled
[128525.034184] [drm] PCIE GART of 512M enabled (table at 0x0000000000142000).
[128525.034222] radeon 0000:04:00.0: WB enabled
[128525.034224] radeon 0000:04:00.0: fence driver on ring 0 use gpu addr 0x0000000010000c00
[128525.034579] radeon 0000:04:00.0: fence driver on ring 5 use gpu addr 0x00000000000521d0
[128525.034797] debugfs: File 'radeon_ring_gfx' in directory '0' already present!
[128525.066237] [drm] ring test on 0 succeeded in 1 usecs
[128525.066242] debugfs: File 'radeon_ring_uvd' in directory '0' already present!
[128525.240884] [drm] ring test on 5 succeeded in 1 usecs
[128525.240893] [drm] UVD initialized successfully.
[128535.695467] radeon 0000:04:00.0: ring 0 stalled for more than 10456msec
[128535.695479] radeon 0000:04:00.0: GPU lockup (current fence id 0x000000000029caf8 last fence id 0x000000000029cafc on ring 0)
[128535.697433] [drm:r600_ib_test [radeon]] *ERROR* radeon: fence wait failed (-35).
[128535.697551] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on GFX ring (-35).

답변2

이는 실제로 하드웨어 오류일 수 있습니다. 커널이 AMD ATI Radeon HD 8670있는 아치 리눅스에서 GPU로 게임을 할 때 내 PC에서 이것을 얻습니다 6.3.1-zen1-1-zen. HP Zendesk는 참고용입니다. 커널을 마지막 LTS와 그 이전 LTS(5.10 iirc)에 드롭하려고 시도했지만 몇 분 동안 게임을 한 후에도 여전히 충돌이 발생합니다.

나는 우연히 동일한 OS와 커널(zen을 사용한 아치)을 실행하는 Dell 홈 서버를 가지고 있고 AMD ATI Radeon HD 8570GPU를 가지고 있습니다. 본질적으로 동일한 카드이지만 DDR5 온보드 iirc가 약간 적습니다.

글쎄, 그래픽 카드(현재 HP mb에서는 8570, Dell에서는 8670)를 변경했는데 8570으로 게임을 하는 데 아무런 문제가 없습니다.

따라서...모두 동일한 하드웨어/소프트웨어/펌웨어/드라이버를 사용해도 8570은 작동하지만 8670은 작동하지 않습니다. 내가 한 일은 카드를 교체하는 것뿐이었습니다. 드라이버나 다른 것을 다시 설치할 필요가 없었습니다. 게임도 참고해야겠어요사용된8670에서는 훌륭하게 작동하므로 언젠가는 사라질 것이라고 생각합니다.

따라서 하드웨어 오류가 드물다는 것을 알고 있지만 이것이 오류가 아니라면 무엇인지 모르겠습니다. 혹시 나쁜 소식을 전해드리게 되어 죄송합니다. 저는 게임용으로 홈 서버를 사용하지 않기 때문에 이 스위치를 만드는 것이 좋습니다.

이것은 내 HP에서 충돌이 발생한 8760의 dmesg 로그 중 하나입니다.

...
[32776.529276] radeon 0000:0b:00.0: ring 0 stalled for more than 28224msec
[32776.529282] radeon 0000:0b:00.0: GPU lockup (current fence id 0x0000000000108667 last fence id 0x00000000001086ba on ring 0)
[32776.673264] radeon 0000:0b:00.0: ring 3 stalled for more than 28228msec
[32776.673268] radeon 0000:0b:00.0: GPU lockup (current fence id 0x00000000000380db last fence id 0x0000000000038154 on ring 3)
[32777.033251] radeon 0000:0b:00.0: ring 0 stalled for more than 28728msec
[32777.033259] radeon 0000:0b:00.0: GPU lockup (current fence id 0x0000000000108667 last fence id 0x00000000001086bb on ring 0)
[32777.177236] radeon 0000:0b:00.0: ring 3 stalled for more than 28732msec
[32777.177240] radeon 0000:0b:00.0: GPU lockup (current fence id 0x00000000000380db last fence id 0x0000000000038156 on ring 3)
[32777.537217] radeon 0000:0b:00.0: ring 0 stalled for more than 29232msec
[32777.537221] radeon 0000:0b:00.0: GPU lockup (current fence id 0x0000000000108667 last fence id 0x00000000001086bc on ring 0)
[32777.681206] radeon 0000:0b:00.0: ring 3 stalled for more than 29236msec
[32777.681209] radeon 0000:0b:00.0: GPU lockup (current fence id 0x00000000000380db last fence id 0x0000000000038159 on ring 3)
[32778.041191] radeon 0000:0b:00.0: ring 0 stalled for more than 29736msec
[32778.041194] radeon 0000:0b:00.0: GPU lockup (current fence id 0x0000000000108667 last fence id 0x00000000001086bd on ring 0)
[32778.185183] radeon 0000:0b:00.0: ring 3 stalled for more than 29740msec
[32778.185186] radeon 0000:0b:00.0: GPU lockup (current fence id 0x00000000000380db last fence id 0x000000000003815a on ring 3)
[32779.776047] BUG: unable to handle page fault for address: ffffbdd0c13e9ffc
[32779.776052] #PF: supervisor read access in kernel mode
[32779.776054] #PF: error_code(0x0000) - not-present page
[32779.776055] PGD 100000067 P4D 100000067 PUD 0 
[32779.776058] Oops: 0000 [#1] PREEMPT SMP NOPTI
[32779.776061] CPU: 8 PID: 157222 Comm: openmw Tainted: G S                 6.1.12-zen1-1-zen #1 f86a89fe584efe7bcf920c69db3728bed4671799
[32779.776064] Hardware name: HP HP EliteDesk 705 G5 SFF/8618, BIOS R09 Ver. 02.02.02 11/15/2019
[32779.776065] RIP: 0010:radeon_ring_backup+0xc2/0x160 [radeon]
[32779.776196] Code: 49 c1 e6 02 4c 89 f7 e8 9c cc ab f5 49 89 45 00 48 89 c2 48 85 c0 74 5f 48 8b 4b 10 41 8d 47 01 45 89 ff 23 43 5c 4a 8d 34 b9 <8b> 36 89 32 41 83 fc 01 74 29 ba 04 00 00 00 eb 04 48 8b 4b 10 8d
[32779.776197] RSP: 0018:ffffbdcccfc5bbd8 EFLAGS: 00010246
[32779.776199] RAX: 0000000000000000 RBX: ffff9460e434d620 RCX: ffffbdccc13ea000
[32779.776201] RDX: ffff9465dbd00000 RSI: ffffbdd0c13e9ffc RDI: 00000000000392d7
[32779.776202] RBP: ffff9460e434d600 R08: 00000000000392d0 R09: 0000000000000006
[32779.776203] R10: fffff6a4d96f4000 R11: 000000000000577f R12: 000000000003dd71
[32779.776204] R13: ffffbdcccfc5bc50 R14: 00000000000f75c4 R15: 00000000ffffffff
[32779.776205] FS:  00007fbd98eb96c0(0000) GS:ffff94677ec00000(0000) knlGS:0000000000000000
[32779.776207] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[32779.776208] CR2: ffffbdd0c13e9ffc CR3: 0000000490706000 CR4: 0000000000350ee0
[32779.776210] Call Trace:
[32779.776212]  <TASK>
[32779.776213]  radeon_gpu_reset+0xf7/0x2f0 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776243]  radeon_gem_wait_idle_ioctl+0xb8/0x100 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776273]  ? radeon_gem_busy_ioctl+0xb0/0xb0 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776302]  drm_ioctl_kernel+0xcd/0x170
[32779.776306]  drm_ioctl+0x1eb/0x450
[32779.776308]  ? radeon_gem_busy_ioctl+0xb0/0xb0 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776337]  radeon_drm_ioctl+0x4d/0x80 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776364]  __x64_sys_ioctl+0x94/0xd0
[32779.776369]  do_syscall_64+0x5f/0x90
[32779.776373]  ? do_syscall_64+0x6b/0x90
[32779.776375]  ? syscall_exit_to_user_mode+0x2c/0x1d0
[32779.776378]  ? syscall_exit_to_user_mode+0x2c/0x1d0
[32779.776380]  ? do_syscall_64+0x6b/0x90
[32779.776382]  ? syscall_exit_to_user_mode+0x2c/0x1d0
[32779.776384]  ? do_syscall_64+0x6b/0x90
[32779.776385]  ? do_syscall_64+0x6b/0x90
[32779.776387]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[32779.776390] RIP: 0033:0x7fbdb591553f
[32779.776418] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[32779.776420] RSP: 002b:00007fbd98eb80f0 EFLAGS: 00200246 ORIG_RAX: 0000000000000010
[32779.776422] RAX: ffffffffffffffda RBX: 00007fbd7d74eb80 RCX: 00007fbdb591553f
[32779.776423] RDX: 00007fbd98eb8190 RSI: 0000000040086464 RDI: 0000000000000010
[32779.776425] RBP: 00007fbd98eb8190 R08: 0000000000000000 R09: ffffffffffffffff
[32779.776426] R10: 0000000000000000 R11: 0000000000200246 R12: 0000000040086464
[32779.776427] R13: 0000000000000010 R14: 000055d27885abd0 R15: 000055d278a375d8
[32779.776429]  </TASK>
[32779.776430] Modules linked in: rfcomm xt_nat veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink br_netfilter bridge stp llc rpcsec_gss_krb5 rpcrdma rdma_cm iw_cm nfsv4 ib_cm dns_resolver ib_core nfs fscache wireguard netfs curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel overlay cmac algif_hash algif_skcipher af_alg bnep isofs cdrom amdgpu gpu_sched drm_buddy squashfs vfat fat iwlmvm mac80211 snd_hda_codec_conexant snd_hda_codec_generic libarc4 ledtrig_audio snd_hda_codec_hdmi intel_rapl_msr radeon snd_hda_intel intel_rapl_common btusb edac_mce_amd btrtl snd_intel_dspcfg btbcm snd_intel_sdw_acpi drm_ttm_helper kvm_amd snd_hda_codec btintel iwlwifi snd_hda_core hp_wmi btmtk ttm snd_hwdep sparse_keymap kvm platform_profile wmi_bmof sp5100_tco bluetooth snd_pcm irqbypass r8169 ucsi_acpi drm_display_helper video cfg80211 psmouse rapl typec_ucsi pcspkr snd_timer realtek k10temp i2c_piix4 ecdh_generic cec
[32779.776479]  ipmi_devintf typec snd mdio_devres soundcore ipmi_msghandler ip6t_REJECT rfkill libphy roles nf_reject_ipv6 joydev wmi mousedev gpio_amdpt xt_hl gpio_generic acpi_cpufreq ip6_tables ip6t_rt mac_hid ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_multiport nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables libcrc32c nfnetlink nfsd auth_rpcgss nfs_acl lockd grace sg crypto_user sunrpc loop fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted asn1_encoder tee usbhid uas usb_storage dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel serio_raw polyval_clmulni atkbd polyval_generic gf128mul libps2 ghash_clmulni_intel vivaldi_fmap sha512_ssse3 nvme aesni_intel crypto_simd nvme_core ccp cryptd xhci_pci i8042 xhci_pci_renesas nvme_common serio
[32779.776522] CR2: ffffbdd0c13e9ffc
[32779.776523] ---[ end trace 0000000000000000 ]---
[32779.776524] RIP: 0010:radeon_ring_backup+0xc2/0x160 [radeon]
[32779.776554] Code: 49 c1 e6 02 4c 89 f7 e8 9c cc ab f5 49 89 45 00 48 89 c2 48 85 c0 74 5f 48 8b 4b 10 41 8d 47 01 45 89 ff 23 43 5c 4a 8d 34 b9 <8b> 36 89 32 41 83 fc 01 74 29 ba 04 00 00 00 eb 04 48 8b 4b 10 8d
[32779.776555] RSP: 0018:ffffbdcccfc5bbd8 EFLAGS: 00010246
[32779.776557] RAX: 0000000000000000 RBX: ffff9460e434d620 RCX: ffffbdccc13ea000
[32779.776558] RDX: ffff9465dbd00000 RSI: ffffbdd0c13e9ffc RDI: 00000000000392d7
[32779.776559] RBP: ffff9460e434d600 R08: 00000000000392d0 R09: 0000000000000006
[32779.776560] R10: fffff6a4d96f4000 R11: 000000000000577f R12: 000000000003dd71
[32779.776561] R13: ffffbdcccfc5bc50 R14: 00000000000f75c4 R15: 00000000ffffffff
[32779.776562] FS:  00007fbd98eb96c0(0000) GS:ffff94677ec00000(0000) knlGS:0000000000000000
[32779.776563] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[32779.776565] CR2: ffffbdd0c13e9ffc CR3: 0000000490706000 CR4: 0000000000350ee0

관련 정보