https://superuser.com/posts/1862116/timeline
I have reinstalled slurm resource management on a HPC cluster. But it seems there is a problem on starting slurmd services. Here is the system status:
"systemctl status slurmd" shows:
● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Sun 2024-11-10 11:31:24 +0130; 1 weeks 1 days ago -------------------------------------------
The slurm.conf output is:
# # Example slurm.conf file. Please run configurator.html # (in doc/html) to build a configuration file customized # for your environment. # # # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # ClusterName=linux ControlMachine=master.cluster.... #ControlAddr= #BackupController= #BackupAddr= # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/var/spool/slurm/ctld SlurmdSpoolDir=/var/spool/slurm/d SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid #ProctrackType=proctrack/pgid
ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup TaskPluginParam=cpusets #PluginDir= #FirstJobId= #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= #TaskPlugin= #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM= #
# GPU definition (Added by S 2022/11) GresTypes=gpu
# TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= #SelectType=select/linear SelectType=select/cons_res # select partial node #SelectTypeParameters=CR_CPU_Memory SelectTypeParameters=CR_Core_Memory #FastSchedule=1 FastSchedule=0 PriorityType=priority/multifactor PriorityDecayHalfLife=0 PriorityUsageResetPeriod=NONE PriorityWeightFairshare=100000 #PriorityWeightAge=1000 PriorityWeightPartition=10000 #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 DefMemPerCPU=2000 # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd.log JobCompType=jobcomp/none #JobCompLoc= # # ACCOUNTING #JobAcctGatherType=jobacct_gather/linux #JobAcctGatherFrequency=30 # AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=master #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser= # # COMPUTE NODES # OpenHPC default configuration # Edited by Surin 1399/11 #AccountingStorageEnforce=limits AccountingStorageEnforce=QOS,Limits,Associations #TaskPlugin=task/affinity #PropagateResourceLimitsExcept=MEMLOCK #AccountingStorageType=accounting_storage/filetxt Epilog=/etc/slurm/slurm.epilog.clean NodeName=cn0[1-5] NodeHostName=cn0[1-5] RealMemory=128307 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Feature=HyperThread State=UNKNOWN NodeName=gp01 NodeHostName=gp01 RealMemory=128307 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 Feature=HyperThread Gres=gpu:4 State=UNKNOWN #PartitionName=all Nodes=cn0[1-5],gp01 MaxTime=INFINITE State=UP Oversubscribe=EXCLUSIVE PartitionName=all Nodes=cn0[1-5],gp01 MaxTime=10-00:00:00 State=UP MaxNodes=1 PartitionName=normal Nodes=cn0[1-5] Default=YES MaxTime=10-00:00:00 State=UP MaxNodes=1 #PartitionName=normal Nodes=cn0[1-5] Default=YES MaxTime=INFINITE State=UP Oversubscribe=EXCLUSIVE PartitionName=gpu Nodes=gp01 MaxTime=10-00:00:00 State=UP SlurmctldParameters=enable_configless ReturnToService=1 HealthCheckProgram=/usr/sbin/nhc HealthCheckInterval=300 -------------------------------------------
scontrol show node cn01 shows:
NodeName=cn01 CoresPerSocket=16 CPUAlloc=0 CPUTot=64 CPULoad=N/A AvailableFeatures=HyperThread ActiveFeatures=HyperThread Gres=(null) NodeAddr=cn01 NodeHostName=cn01 RealMemory=128557 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1 State=DOWN* ThreadsPerCore=2 TmpDisk=64278 Weight=1 Owner=N/A MCS_label=N/A Partitions=all,normal BootTime=None SlurmdStartTime=None CfgTRES=cpu=64,mem=128557M,billing=64 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=undraining -------------------------------------------
"scontrol ping" also works as:
Slurmctld(primary) at master.cluster... is UP -------------------------------------------
"systemctl start slurmd" shows:
Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details. -------------------------------------------
"systemctl status slurmd.service" shows:
● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2024-11-18 17:02:50 +0130; 41s ago Process: 219025 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Nov 18 17:02:50 master.cluster.... systemd[1]: Starting Slurm node daemon... Nov 18 17:02:50 master.cluster.... slurmd[219025]: fatal: Unable to determine this slurmd's NodeName Nov 18 17:02:50 master.cluster.... systemd[1]: slurmd.service: control process exited, code=exited status=1 Nov 18 17:02:50 master.cluster.... systemd[1]: Failed to start Slurm node daemon. Nov 18 17:02:50 master.cluster.... systemd[1]: Unit slurmd.service entered failed state. Nov 18 17:02:50 master.cluster.... systemd[1]: slurmd.service failed.
-------------------------------------------
"journalctl -xe" output is:
Nov 18 17:04:54 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:04 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:08 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:12 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:22 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:26 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:06:34 master.cluster.... munged[2514]: Purged 2 credentials from replay hash
-------------------------------------------
slurmd.services contains the following information:
[Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target ConditionPathExists=/etc/slurm/slurm.conf
[Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurmd.pid KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes
[Install] WantedBy=multi-user.target [root@master system]# which slurmd /usr/sbin/slurmd [root@master system]# cat slurmd.service [Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target ConditionPathExists=/etc/slurm/slurm.conf
[Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurmd.pid KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes
[Install] WantedBy=multi-user.target -------------------------------------------
For "which slurmd" command I get this address: /usr/sbin/slurm
-------------------------------------------
ls -l /usr/sbin/slurmd shows:
lrwxrwxrwx 1 root root 69 Nov 10 10:54 /usr/sbin/slurmd -> /install/centos7.9/compute_gpu/rootimg/usr/sbin/slurmd