여러 GPU(GTX 1060 x6)를 실행하고 있는데 그 중 2개가 응답하지 않습니다. lspci -nnk(아래 출력)를 사용하여 드라이버를 쿼리한 결과 4개의 GPU에 "커널 드라이버 사용 중: nvidia"가 있는 반면 다른 두 GPU에는 "커널 드라이버"가 나열되지 않은 것으로 나타났습니다. 저는 Tensorflow용 CUDA 8.0 및 nvidia-387 드라이버가 설치된(오픈 소스) 4.4.0-104-generic에서 Ubuntu 16.04 LTS를 실행하고 있습니다. 커널 드라이버가 표시되지 않는 이유에 대해 알고 계시나요?
00:00.0 Host bridge [0600]: Intel Corporation Sky Lake Host Bridge/DRAM Registers [8086:190f] (rev 07)
Subsystem: ASUSTeK Computer Inc. Skylake Host Bridge/DRAM Registers [1043:8694]
00:01.0 PCI bridge [0604]: Intel Corporation Sky Lake PCIe Controller (x16) [8086:1901] (rev 07)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:14.0 USB controller [0c03]: Intel Corporation Device [8086:a2af]
Subsystem: ASUSTeK Computer Inc. Device [1043:8694]
Kernel driver in use: xhci_hcd
00:16.0 Communication controller [0780]: Intel Corporation Device [8086:a2ba]
Subsystem: ASUSTeK Computer Inc. Device [1043:8694]
Kernel driver in use: mei_me
Kernel modules: mei_me
00:17.0 SATA controller [0106]: Intel Corporation Device [8086:a282]
Subsystem: ASUSTeK Computer Inc. Device [1043:8694]
Kernel driver in use: ahci
Kernel modules: ahci
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:a294] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1c.6 PCI bridge [0604]: Intel Corporation Device [8086:a296] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1c.7 PCI bridge [0604]: Intel Corporation Device [8086:a297] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1d.0 PCI bridge [0604]: Intel Corporation Device [8086:a298] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1d.2 PCI bridge [0604]: Intel Corporation Device [8086:a29a] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1d.3 PCI bridge [0604]: Intel Corporation Device [8086:a29b] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:a2c8]
Subsystem: ASUSTeK Computer Inc. Device [1043:8694]
00:1f.2 Memory controller [0580]: Intel Corporation Device [8086:a2a1]
Subsystem: ASUSTeK Computer Inc. Device [1043:8694]
00:1f.3 Audio device [0403]: Intel Corporation Device [8086:a2f0]
Subsystem: ASUSTeK Computer Inc. Device [1043:8723]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
00:1f.4 SMBus [0c05]: Intel Corporation Device [8086:a2a3]
Subsystem: ASUSTeK Computer Inc. Device [1043:8694]
Kernel modules: i2c_i801
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (2) I219-V [8086:15b8]
Subsystem: ASUSTeK Computer Inc. Ethernet Connection (2) I219-V [1043:8672]
Kernel driver in use: e1000e
Kernel modules: e1000e
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1c03] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:6163]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_387_drm, nvidia_387
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f1] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:6163]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1c03] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:6161]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_387_drm, nvidia_387
02:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f1] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:6161]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
03:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1c03] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:6163]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_387_drm, nvidia_387
03:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f1] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:6163]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1c03] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:6161]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_387_drm, nvidia_387
04:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f1] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:6161]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1c03] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:6163]
Kernel modules: nvidiafb, nouveau, nvidia_387_drm, nvidia_387
06:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f1] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:6163]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
07:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1c03] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:6163]
Kernel modules: nvidiafb, nouveau, nvidia_387_drm, nvidia_387
07:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f1] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:6163]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
답변1
업데이트: 해결책은 BIOS를 조작하여 모든 PCIe 레인을 열고 "4G 이상 디코딩 활성화"를 사용하는 것입니다.
답변2
"nouveau" 모듈/드라이버를 제거해야 합니다.
gnome을 사용한다고 가정하고 "소프트웨어 및 업데이트" -> "추가 드라이버"를 열고 "NVIDIA 바이너리 드라이버" 중 하나로 변경합니다.
그래도 문제가 해결되지 않으면 여기에 설명된 블랙리스트 방법을 사용하는 것이 유일한 옵션입니다.
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html