Dear slurm-user list,
I got this error:
Unable to start service slurmctld: Job for slurmctld.service failed because the control process exited with error code.\nSee "systemctl status slurmctld.service" and "journalctl -xeu slurmctld.service" for details.
but in slurmctld.service I see nothing suspicious:
slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/slurmctld.service.d └─override.conf Active: active (running) since Wed 2024-02-07 15:50:56 UTC; 19min ago Main PID: 51552 (slurmctld) Tasks: 21 (limit: 9363) Memory: 10.4M CPU: 1min 16.088s CGroup: /system.slice/slurmctld.service ├─51552 /usr/sbin/slurmctld --systemd └─51553 "slurmctld: slurmscriptd" "" "" "" "" "" ""
Feb 07 15:58:21 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: sched: _slurm_rpc_allocate_resources JobId=3 NodeList=(null) usec=959 Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=3 WTERMSIG 2 Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=3 cancelled by interactive user Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=3 done Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _slurm_rpc_complete_job_allocation: JobId=3 error Job/step already completing or completed Feb 07 15:58:42 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: sched: _slurm_rpc_allocate_resources JobId=4 NodeList=cluster-master-2vt2bqh7ahec04c,cluster-worker-2vt2bqh7ahec04c-2 usec=512 Feb 07 16:06:04 cluster-master-2vt2bqh7ahec04c slurmctld[51553]: slurmctld: error: _run_script: JobId=0 resumeprog exit status 1:0 Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=4 WTERMSIG 2 Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=4 done Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _slurm_rpc_complete_job_allocation: JobId=4 error Job/step already completing or completed
I am unsure how to debug this further. It might be coming from a previous problem I tried to fix (basically a few deprecated keys in the configuration).
I will try to restart the entire cluster with the added changes to rule out any follow up errors, but maybe it's something obvious a fellow list user can see.
Best regards, Xaver