GPU가 깨어날 때 Arch와 충돌하는 것 같습니다.

GPU가 깨어날 때 Arch와 충돌하는 것 같습니다.

증상:
덮개를 닫은 후 노트북이 성공적으로 절전 모드로 전환됩니다(팬이 멈추는 것으로 알 수 있음). 자주 일어나면(10번 중 1번) 모든 것이 멈춥니다. X실행할 때 이런 일이 발생합니다.ratpoison 뿐만 아니라두 가상 터미널 사이를 전환하는 경우(X 없이 실행, clrl전환하려면 ++ 사용)altF2

이 문제는 어떻게 조사되어야 합니까? 기타 데이터:

운영 체제:

[miro@katana ~]$ uname -a
Linux katana 5.19.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 17 Aug 2022 13:48:51 +0000 x86_64 GNU/Linux

관련 부품 dmesg:

[    6.340112] nvidia-nvlink: Nvlink Core is being initialized, major device number 510

[    6.340135] traps: Missing ENDBR: _nv011437rm+0x0/0x10 [nvidia]
[    6.340371] ------------[ cut here ]------------
[    6.340371] kernel BUG at arch/x86/kernel/traps.c:253!
[    6.340375] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[    6.340379] CPU: 13 PID: 328 Comm: systemd-modules Tainted: P           OE     5.19.2-arch1-1 #1 1368c994e25e19983709ee8b14ef7d9de0c6
a97a
[    6.340383] Hardware name: Micro-Star International Co., Ltd. Katana GF76 11UC/MS-17L2, BIOS E17L2IMS.30F 12/02/2021
[    6.340384] RIP: 0010:exc_control_protection+0xc2/0xd0
[    6.340390] Code: 8b 93 80 00 00 00 be fa 00 00 00 48 c7 c7 56 4f c8 aa e8 b1 1e 4c ff e9 72 ff ff ff 48 c7 c7 3d 4f c8 aa e8 c4 15 f
b ff 0f 0b <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 66 0f 1f 00 55 53 48 89
[    6.340393] RSP: 0018:ffffc2370055bb88 EFLAGS: 00010002
[    6.340395] RAX: 0000000000000033 RBX: ffffc2370055bba8 RCX: 0000000000000027
[    6.340397] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffa0612fb61660
[    6.340398] RBP: 0000000000000003 R08: 0000000000000000 R09: ffffc2370055ba20
[    6.340400] R10: 0000000000000003 R11: ffffffffab4cb428 R12: 0000000000000000
[    6.340401] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    6.340403] FS:  00007f9a7ace64c0(0000) GS:ffffa0612fb40000(0000) knlGS:0000000000000000
[    6.340405] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.340406] CR2: 00007f9a78f81000 CR3: 0000000100b36002 CR4: 0000000000f70ee0
[    6.340408] PKRU: 55555554
[    6.340409] Call Trace:
[    6.340410]  <TASK>
[    6.340412]  asm_exc_control_protection+0x26/0x30
[    6.340415] RIP: 0010:_nv011437rm+0x0/0x10 [nvidia]
[    6.340646] Code: 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec 08 e8 c7 12 1e 00 48 83 c4 08 48 89 c7 e9 bb ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 <48> 89 f7 e9 18 08 00 00 0f 1f 84 00 00 00 00 00 48 89 f7 e9 18 08
[    6.340649] RSP: 0018:ffffc2370055bc58 EFLAGS: 00010202
[    6.340650] RAX: ffffffffc186e8c0 RBX: ffffffffc3a853b0 RCX: 0000000000000000
[    6.340652] RDX: 00000000000d592e RSI: 0000000000000010 RDI: ffffffffc3a853b0
[    6.340654] RBP: ffffa05de7fc5fe0 R08: 0000000000000020 R09: ffffffffc3a853f0
[    6.340655] R10: ffffffffc3a3c0f0 R11: 0000000000000000 R12: 0000000000000010
[    6.340657] R13: ffffa05de7fc3000 R14: 00007f9a7b1f9343 R15: ffffc2370055bdd8
[    6.340659]  ? _nv034928rm+0x20/0x20 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.340888]  _nv011435rm+0x24/0xe0 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.341114]  _nv034929rm+0xe/0xa0 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.341341]  _nv034932rm+0x1d/0x30 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.341565]  _nv034934rm+0x2f/0x40 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.341790]  _nv015577rm+0x15/0x70 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.341913]  _nv000643rm+0x9/0x20 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.342034]  ? cdev_add+0x50/0x70
[    6.342037]  rm_init_rm+0x17/0x60 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.342227]  nvidia_init_module+0x242/0x616 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.342368]  ? nvidia_init_module+0x616/0x616 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.342503]  nvidia_frontend_init_module+0x50/0x94 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.342641]  ? nvidia_init_module+0x616/0x616 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.342775]  do_one_initcall+0x5a/0x220
[    6.342780]  do_init_module+0x4a/0x1e0
[    6.342783]  __do_sys_init_module+0x138/0x1b0
[    6.342785]  do_syscall_64+0x5c/0x90
[    6.342789]  ? handle_mm_fault+0xb2/0x280
[    6.342791]  ? do_user_addr_fault+0x1db/0x690
[    6.342795]  ? exc_page_fault+0x74/0x170
[    6.342796]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[    6.342800] RIP: 0033:0x7f9a7b0e6ace
[    6.342803] Code: 48 8b 0d d5 f2 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a2 f2 0c 00 f7 d8 64 89 01 48
[    6.342805] RSP: 002b:00007ffcb8c96d08 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[    6.342808] RAX: ffffffffffffffda RBX: 0000558a439e89d0 RCX: 00007f9a7b0e6ace
[    6.342809] RDX: 00007f9a7b1f9343 RSI: 0000000003ce1720 RDI: 00007f9a752a0010
[    6.342811] RBP: 00007f9a7b1f9343 R08: 0000558a439e88d0 R09: 0000000000000000
[    6.342812] R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000020000
[    6.342814] R13: 0000558a439e8aa0 R14: 0000558a439e89d0 R15: 0000558a439e8c00
[    6.342816]  </TASK>
[    6.342817] Modules linked in: intel_rapl_msr(+) wmi_bmof(+) sparse_keymap(+) pcc_cpufreq(-) fjes(-) acpi_cpufreq(-) gpio_keys pmt_class snd_pcm_dmaengine mac80211(+) kvm(+) snd_hda_intel irqbypass libarc4 snd_intel_dspcfg crct10dif_pclmul nvidia(POE+) btusb crc32_pclmul snd_intel_sdw_acpi ghash_clmulni_intel iwlwifi btrtl aesni_intel snd_hda_codec uvcvideo btbcm crypto_simd snd_hda_core iwlmei btintel videobuf2_vmalloc cryptd intel_cstate btmtk snd_hwdep processor_thermal_device_pci_legacy videobuf2_memops i915(+) intel_uncore videobuf2_v4l2 cfg80211 processor_thermal_device snd_pcm psmouse spi_intel_pci r8169 bluetooth pcspkr videobuf2_common drm_buddy processor_thermal_rfim spi_intel snd_timer ttm realtek processor_thermal_mbox snd videodev i2c_i801 mei_me mdio_devres drm_display_helper vfat tpm_crb processor_thermal_rapl ecdh_generic intel_lpss_pci soundcore i2c_smbus fat libphy cec intel_rapl_common rfkill mc tpm_tis intel_lpss mei int340x_thermal_zone idma64 intel_gtt i2c_hid_acpi
[    6.342846]  tpm_tis_core intel_vsec wmi intel_soc_dts_iosf i2c_hid tpm soc_button_array rng_core mac_hid int3400_thermal acpi_pad acpi_tad acpi_thermal_rel video crypto_user fuse bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 serio_raw atkbd libps2 vivaldi_fmap nvme xhci_pci crc32c_intel nvme_core i8042 xhci_pci_renesas serio
[    6.342893] R10: 0000000000000003 R11: ffffffffab4cb428 R12: 0000000000000000
[    6.342895] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    6.342896] FS:  00007f9a7ace64c0(0000) GS:ffffa0612fb40000(0000) knlGS:0000000000000000
[    6.342898] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.342900] CR2: 00007f9a78f81000 CR3: 0000000100b36002 CR4: 0000000000f70ee0
[    6.342901] PKRU: 55555554

운전사:

[miro@katana ~]$ pacman -Ss nvidia | grep installed
extra/egl-wayland 2:1.1.10-1 [installed]
extra/ffnvcodec-headers 11.1.5.1-2 [installed]
extra/libvdpau 1.5-1 [installed]
extra/nvidia 515.65.01-8 [installed]
extra/nvidia-utils 515.65.01-2 [installed]
community/nvtop 2.0.2-1 [installed]

하드웨어: MSI Katana GF76 11UC

뚜껑을 닫고 팬을 다시 가동한 후 10분 후:

[root@katana miro]# shutdown now
Failed to power off system via logind: There's already a shutdown or sleep operation in progress

나에게는 마법처럼 들리지만 udev나는 그것에 대해 아무것도 모릅니다.


최근 일부 오류 메시지(안타깝게도 삭제함)에서는 nvidia-sleep.sh시스템 전원 상태가 전환되지 않는 중단을 언급했습니다. 조사 결과 VT가 범인인 것으로 나타났습니다. 파일은 다음과 같습니다.

#!/bin/bash

if [ ! -f /proc/driver/nvidia/suspend ]; then
    exit 0
fi

RUN_DIR="/var/run/nvidia-sleep"
XORG_VT_FILE="${RUN_DIR}"/Xorg.vt_number

PATH="/bin:/usr/bin"

case "$1" in
    suspend|hibernate)
        mkdir -p "${RUN_DIR}"
        fgconsole > "${XORG_VT_FILE}"
        chvt 63
        if [[ $? -ne 0 ]]; then
            exit $?
        fi
        echo "$1" > /proc/driver/nvidia/suspend
        exit $?
        ;;
    resume)
        echo "$1" > /proc/driver/nvidia/suspend 
        #
        # Check if Xorg was determined to be running at the time
        # of suspend, and whether its VT was recorded.  If so,
        # attempt to switch back to this VT.
        #
        if [[ -f "${XORG_VT_FILE}" ]]; then
            XORG_PID=$(cat "${XORG_VT_FILE}")
            rm "${XORG_VT_FILE}"
            chvt "${XORG_PID}"
        fi
        exit 0
        ;;
    *)
        exit 1
esac

답변1

여러 인스턴스로 인해 nvidia-sleep.sh resume시스템을 전원 관리에 사용할 수 없게 됩니다( sudo shutdown now오류). 도저히 감당할 수 없을 것 같아서 다음과 같은 야만적인 "임시" 해결 방법을 구현했습니다.

resume)
++    exit 0

나는 약 30번의 절전 주기에서만 테스트했습니다(사이에 재부팅하지 않음). 완벽하게 작동했습니다.

어쨌든 요점은 무엇입니까? Xorg의 VT로 다시 전환하는 이유는 무엇입니까? 여러 VT에 Xorg 인스턴스가 있으면 어떻게 되나요? 콘솔 VT에서 작업 중인데 중단한 부분에서 일어나려면 어떻게 해야 합니까? 나에게는 말이되지 않습니다.

관련 정보