Dear slurm-user list, I got this error: Unable to start service slurmctld: Job for slurmctld.service failed because the control process exited with error code.\nSee \"systemctl status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for details. but in slurmctld.service I see nothing suspicious: slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/slurmctld.service.d └─override.conf Active: active (running) since Wed 2024-02-07 15:50:56 UTC; 19min ago Main PID: 51552 (slurmctld) Tasks: 21 (limit: 9363) Memory: 10.4M CPU: 1min 16.088s CGroup: /system.slice/slurmctld.service ├─51552 /usr/sbin/slurmctld --systemd └─51553 "slurmctld: slurmscriptd" "" "" "" "" "" "" Feb 07 15:58:21 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: sched: _slurm_rpc_allocate_resources JobId=3 NodeList=(null) usec=959 Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=3 WTERMSIG 2 Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=3 cancelled by interactive user Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=3 done Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _slurm_rpc_complete_job_allocation: JobId=3 error Job/step already completing or completed Feb 07 15:58:42 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: sched: _slurm_rpc_allocate_resources JobId=4 NodeList=cluster-master-2vt2bqh7ahec04c,cluster-worker-2vt2bqh7ahec04c-2 usec=512 Feb 07 16:06:04 cluster-master-2vt2bqh7ahec04c slurmctld[51553]: slurmctld: error: _run_script: JobId=0 resumeprog exit status 1:0 Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=4 WTERMSIG 2 Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=4 done Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _slurm_rpc_complete_job_allocation: JobId=4 error Job/step already completing or completed I am unsure how to debug this further. It might be coming from a previous problem I tried to fix (basically a few deprecated keys in the configuration). I will try to restart the entire cluster with the added changes to rule out any follow up errors, but maybe it's something obvious a fellow list user can see. Best regards, Xaver
Your systemctl output shows that slurmctld is running OK, but that doesn't match with your first entry, so it's hard to tell what's going on. But if slurmctld won't start under systemd but it's not clear why the first step would be to enable something like `SlurmctldDebug = debug` and check the full logs in journalctl or just run slurmctld in the forground with: /usr/sbin/slurmctld -D -vvv Make sure the system service is properly stopped and there aren't any rouge slurmctld processes anywhere. Many thanks, Luke -- Luke Sudbery Principal Engineer (HPC and Storage). Architecture, Infrastructure and Systems Advanced Research Computing, IT Services Room 132, Computer Centre G5, Elms Road Please note I don’t work on Monday. -----Original Message----- From: Xaver Stiensmeier via slurm-users <slurm-users@lists.schedmd.com> Sent: Wednesday, February 7, 2024 4:17 PM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Errors upgrading to 23.11.0 CAUTION: This email originated from outside the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe. Dear slurm-user list, I got this error: Unable to start service slurmctld: Job for slurmctld.service failed because the control process exited with error code.\nSee \"systemctl status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for details. but in slurmctld.service I see nothing suspicious: slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/slurmctld.service.d └─override.conf Active: active (running) since Wed 2024-02-07 15:50:56 UTC; 19min ago Main PID: 51552 (slurmctld) Tasks: 21 (limit: 9363) Memory: 10.4M CPU: 1min 16.088s CGroup: /system.slice/slurmctld.service ├─51552 /usr/sbin/slurmctld --systemd └─51553 "slurmctld: slurmscriptd" "" "" "" "" "" "" Feb 07 15:58:21 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: sched: _slurm_rpc_allocate_resources JobId=3 NodeList=(null) usec=959 Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=3 WTERMSIG 2 Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=3 cancelled by interactive user Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=3 done Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _slurm_rpc_complete_job_allocation: JobId=3 error Job/step already completing or completed Feb 07 15:58:42 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: sched: _slurm_rpc_allocate_resources JobId=4 NodeList=cluster-master-2vt2bqh7ahec04c,cluster-worker-2vt2bqh7ahec04c-2 usec=512 Feb 07 16:06:04 cluster-master-2vt2bqh7ahec04c slurmctld[51553]: slurmctld: error: _run_script: JobId=0 resumeprog exit status 1:0 Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=4 WTERMSIG 2 Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _job_complete: JobId=4 done Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]: slurmctld: _slurm_rpc_complete_job_allocation: JobId=4 error Job/step already completing or completed I am unsure how to debug this further. It might be coming from a previous problem I tried to fix (basically a few deprecated keys in the configuration). I will try to restart the entire cluster with the added changes to rule out any follow up errors, but maybe it's something obvious a fellow list user can see. Best regards, Xaver -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Thank you for your response. I have found found out why there was no error in the log: I've been looking at the wrong log. The error didn't occur on the master, but on our vpn-gateway (it is a hybrid cloud setup) - but you can thin of it as just another worker in the same network. The error I get there is: ` Feb 08 11:38:25 cluster-vpngtw-3ts770ji3a8ubr1-0 slurmctld[32014]: slurmctld: fatal: auth/jwt: cannot stat '/etc/slurm/jwt-secret.key': No such file or directory Feb 08 11:38:25 cluster-vpngtw-3ts770ji3a8ubr1-0 systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE Feb 08 11:38:25 cluster-vpngtw-3ts770ji3a8ubr1-0 systemd[1]: slurmctld.service: Failed with result 'exit-code'. Feb 08 11:38:25 cluster-vpngtw-3ts770ji3a8ubr1-0 systemd[1]: Failed to start Slurm controller daemon. ` In the past we have created the `jwt-secret.key` on the master at `etc/slurm` and that was enough, but I must admit that I am not completely familiar with it, but I will now look into it closer and also double check whether such a key is stored there in the old slurm version. Best regards, Xaver On 08.02.24 11:07, Luke Sudbery via slurm-users wrote:
Your systemctl output shows that slurmctld is running OK, but that doesn't match with your first entry, so it's hard to tell what's going on.
But if slurmctld won't start under systemd but it's not clear why the first step would be to enable something like `SlurmctldDebug = debug` and check the full logs in journalctl or just run slurmctld in the forground with:
/usr/sbin/slurmctld -D -vvv
Make sure the system service is properly stopped and there aren't any rouge slurmctld processes anywhere.
Many thanks,
Luke
participants (2)
-
Luke Sudbery -
Xaver Stiensmeier