우분투 PC에 slurm을 설치하려고 합니다. 그래서 위에 주어진 지침을 따랐습니다.여기
나는 다음을 수행했습니다 -
sudo apt update -y
sudo apt install slurmd slurmctld -y
mkdir sudo /etc/slurm-llnl
참고로...3단계는 제가 직접 알아냈어요sudo chmod 777 /etc/slurm-llnl
sudo cat << EOF > /etc/slurm-llnl/slurm.conf
ClusterName=localcluster
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
# COMPUTE NODES
NodeName=localhost CPUs=12 RealMemory=8000 State=UNKNOWN
PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
EOF
sudo systemctl start slurmctld
sudo systemctl start slurmd
이제 내가 이 일을 할 때 -
sudo scontrol update nodename=localhost state=idle
오류가 발생합니다.
scontrol: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
scontrol: error: fetch_config: DNS SRV lookup failed
scontrol: error: _establish_config_source: failed to fetch config
scontrol: fatal: Could not establish a configuration source
편집 1-
나는 폴의 지시를 따랐다. 이제 다음과 같은 결과가 나타납니다.
(base) thoma@thoma-Lenovo-Legion-5-15IMH05H:/$ systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2024-03-05 05:57:17 CST; 2h 42min ago
Docs: man:slurmctld(8)
Main PID: 6509 (slurmctld)
Tasks: 10
Memory: 4.3M
CPU: 2.378s
CGroup: /system.slice/slurmctld.service
├─6509 /usr/sbin/slurmctld -D -s
└─6517 "slurmctld: slurmscriptd" "" ""
Mar 05 05:58:27 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: Invalid node state transition requested for node localhost from=INVAL to=IDLE
Mar 05 05:58:27 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: _slurm_rpc_update_node for localhost: Invalid node state specified
Mar 05 06:00:07 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: Invalid node state transition requested for node localhost from=INVAL to=IDLE
Mar 05 06:00:07 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: _slurm_rpc_update_node for localhost: Invalid node state specified
Mar 05 06:01:30 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: Invalid node state transition requested for node localhost from=INVAL to=RESUME
Mar 05 06:01:30 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: _slurm_rpc_update_node for localhost: Invalid node state specified
Mar 05 06:02:13 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: Invalid node state transition requested for node localhost from=INVAL to=RESUME
Mar 05 06:02:13 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: _slurm_rpc_update_node for localhost: Invalid node state specified
Mar 05 06:02:20 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: Invalid node state transition requested for node localhost from=INVAL to=IDLE
Mar 05 06:02:20 thoma-Lenovo-Legion-5-15IMH05H slurmctld[6509]: slurmctld: _slurm_rpc_update_node for localhost: Invalid node state specified
(base) thoma@thoma-Lenovo-Legion-5-15IMH05H:/$ systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2024-03-05 05:57:17 CST; 2h 42min ago
Docs: man:slurmd(8)
Main PID: 6514 (slurmd)
Tasks: 1
Memory: 316.0K
CPU: 22ms
CGroup: /system.slice/slurmd.service
└─6514 /usr/sbin/slurmd -D -s
Mar 05 05:57:17 thoma-Lenovo-Legion-5-15IMH05H systemd[1]: Started Slurm node daemon.
Mar 05 05:57:17 thoma-Lenovo-Legion-5-15IMH05H slurmd[6514]: slurmd: error: Node configuration differs from hardware: CPUs=12:12(hw) Boards=1:1(hw) SocketsPerBoard=12:1(hw) CoresPerSocket=1:6(hw) ThreadsPerCore>
Mar 05 05:57:17 thoma-Lenovo-Legion-5-15IMH05H slurmd[6514]: slurmd: slurmd version 21.08.5 started
Mar 05 05:57:17 thoma-Lenovo-Legion-5-15IMH05H slurmd[6514]: slurmd: slurmd started on Tue, 05 Mar 2024 05:57:17 -0600
Mar 05 05:57:17 thoma-Lenovo-Legion-5-15IMH05H slurmd[6514]: slurmd: CPUs=12 Boards=1 Sockets=12 Cores=1 Threads=1 Memory=7838 TmpDisk=1252975 Uptime=372 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(>
lines 1-16/16 (END)
답변1
당신도 봉사를 시작하셨나요 munge
?
다음 과 systemctl
같이 실행해 보세요.
sudo systemctl start munge
sudo systemctl status munge
주의를 기울이는 것이 좋습니다이 가이드Ubuntu 22.04의 단일 노드 환경에 Slurm을 설치하는 방법에 대한 기사를 작성했습니다.
건배.
답변2
제공하신 systemctl 구성을 살펴보면 다음 사항을 알 수 있습니다.
1-에 관해서는희미한, slurm.conf에 정의한 하드웨어 구성이 올바르지 않습니다. 이 구성이 실행될 노드의 하드웨어 사양은 무엇입니까?
(Mar 05 05:57:17 thoma-Lenovo-Legion-5-15IMH05H slurmd[6514]: slurmd: error: Node configuration differs from hardware: CPUs=12:12(hw) Boards=1:1(hw) SocketsPerBoard=12:1(hw) CoresPerSocket=1:6(hw) ThreadsPerCore>)
이 출력에 따르면 귀하의 가치는보드당 소켓 수그리고소켓당 코어 수, 해야 한다1그리고6각기.
2- 소개slurmctld, 초기 노드 상태는 다음과 같아야 합니다.알려지지 않은, 이와 같이.
NodeName=localhost CPUs=12 RealMemory=30517 State=UNKNOWN PartitionName=localhost Nodes=ALL Default=YES MaxTime=INFINITE State=UP
참고: "8000"당신의실제 기억값. " 값을 사용해 보세요.8192"반대로 Slurm은 MiB 값을 사용합니다 :)
그런 다음 이것을 변경해보십시오.재시작둘 다희미한그리고slurmctld이것이 도움이 된다면 알려주세요.
건배!