[slurm-users] Re: Errors upgrading to 23.11.0

8 Feb 2024

      Your systemctl output shows that slurmctld is running OK, but that doesn't match with your first entry, so it's hard to tell what's going on.
But if slurmctld won't start under systemd but it's not clear why the first step would be to enable something like `SlurmctldDebug = debug` and check the full logs in journalctl or just run slurmctld in the forground with:
/usr/sbin/slurmctld -D -vvv
Make sure the system service is properly stopped and there aren't any rouge slurmctld processes anywhere.
Many thanks,
Luke
-- 
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road

Please note I don’t work on Monday.

-----Original Message-----
From: Xaver Stiensmeier via slurm-users slurm-users@lists.schedmd.com 
Sent: Wednesday, February 7, 2024 4:17 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Errors upgrading to 23.11.0

CAUTION: This email originated from outside the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe.

Dear slurm-user list,

I got this error:

Unable to start service slurmctld: Job for slurmctld.service failed
because the control process exited with error code.\nSee "systemctl
status slurmctld.service" and "journalctl -xeu slurmctld.service" for
details.

but in slurmctld.service I see nothing suspicious:

slurmctld.service - Slurm controller daemon
      Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
vendor preset: enabled)
     Drop-In: /etc/systemd/system/slurmctld.service.d
              └─override.conf
      Active: active (running) since Wed 2024-02-07 15:50:56 UTC; 19min ago
    Main PID: 51552 (slurmctld)
       Tasks: 21 (limit: 9363)
      Memory: 10.4M
         CPU: 1min 16.088s
      CGroup: /system.slice/slurmctld.service
              ├─51552 /usr/sbin/slurmctld --systemd
              └─51553 "slurmctld: slurmscriptd" "" "" "" "" "" ""

Feb 07 15:58:21 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: sched: _slurm_rpc_allocate_resources JobId=3 NodeList=(null)
usec=959
Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=3 WTERMSIG 2
Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=3 cancelled by interactive user
Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=3 done
Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _slurm_rpc_complete_job_allocation: JobId=3 error Job/step
already completing or completed
Feb 07 15:58:42 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: sched: _slurm_rpc_allocate_resources JobId=4
NodeList=cluster-master-2vt2bqh7ahec04c,cluster-worker-2vt2bqh7ahec04c-2
usec=512
Feb 07 16:06:04 cluster-master-2vt2bqh7ahec04c slurmctld[51553]:
slurmctld: error: _run_script: JobId=0 resumeprog exit status 1:0
Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=4 WTERMSIG 2
Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=4 done
Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _slurm_rpc_complete_job_allocation: JobId=4 error Job/step
already completing or completed

I am unsure how to debug this further. It might be coming from a
previous problem I tried to fix (basically a few deprecated keys in the
configuration).

I will try to restart the entire cluster with the added changes to rule
out any follow up errors, but maybe it's something obvious a fellow list
user can see.

Best regards,
Xaver

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

2025

2024

[slurm-users] Re: Errors upgrading to 23.11.0