Hi there, SLURM community,
I swear I've done this before, but now it's failing on a new cluster I'm deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I run `srun -n 500 hostname`, the task gets queued since there's not 500 available CPU.
Wasn't there an option that allows for this to be run where the first 384 tasks execute, and then the remaining execute when resources free up?
Here's my conf:
# Slurm Cgroup Configs used on controllers and workersslurm_cgroup_config: CgroupAutomount: yes ConstrainCores: yes ConstrainRAMSpace: yes ConstrainSwapSpace: yes ConstrainDevices: yes# Slurm conf file settingsslurm_config: AccountingStorageType: "accounting_storage/slurmdbd" AccountingStorageEnforce: "limits" AuthAltTypes: "auth/jwt" ClusterName: "cluster" AccountingStorageHost : "{{ hostvars[groups['controller'][0]].ansible_hostname }}" DefMemPerCPU: 1024 InactiveLimit: 120 JobAcctGatherType: "jobacct_gather/cgroup" JobCompType: "jobcomp/none" MailProg: "/usr/bin/mail" MaxArraySize: 40000 MaxJobCount: 100000 MinJobAge: 3600 ProctrackType: "proctrack/cgroup" ReturnToService: 2 SelectType: "select/cons_tres" SelectTypeParameters: "CR_Core_Memory" SlurmctldTimeout: 30 SlurmctldLogFile: "/var/log/slurm/slurmctld.log" SlurmdLogFile: "/var/log/slurm/slurmd.log" SlurmdSpoolDir: "/var/spool/slurm/d" SlurmUser: "{{ slurm_user.name }}" SrunPortRange: "60000-61000" StateSaveLocation: "/var/spool/slurm/ctld" TaskPlugin: "task/affinity,task/cgroup" UnkillableStepTimeout: 120
Following up on this in case anyone can provide some insight, please.
On Thu, May 16, 2024 at 8:32 AM Dan Healy daniel.t.healy@gmail.com wrote:
Hi there, SLURM community,
I swear I've done this before, but now it's failing on a new cluster I'm deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I run `srun -n 500 hostname`, the task gets queued since there's not 500 available CPU.
Wasn't there an option that allows for this to be run where the first 384 tasks execute, and then the remaining execute when resources free up?
Here's my conf:
# Slurm Cgroup Configs used on controllers and workersslurm_cgroup_config: CgroupAutomount: yes ConstrainCores: yes ConstrainRAMSpace: yes ConstrainSwapSpace: yes ConstrainDevices: yes# Slurm conf file settingsslurm_config: AccountingStorageType: "accounting_storage/slurmdbd" AccountingStorageEnforce: "limits" AuthAltTypes: "auth/jwt" ClusterName: "cluster" AccountingStorageHost : "{{ hostvars[groups['controller'][0]].ansible_hostname }}" DefMemPerCPU: 1024 InactiveLimit: 120 JobAcctGatherType: "jobacct_gather/cgroup" JobCompType: "jobcomp/none" MailProg: "/usr/bin/mail" MaxArraySize: 40000 MaxJobCount: 100000 MinJobAge: 3600 ProctrackType: "proctrack/cgroup" ReturnToService: 2 SelectType: "select/cons_tres" SelectTypeParameters: "CR_Core_Memory" SlurmctldTimeout: 30 SlurmctldLogFile: "/var/log/slurm/slurmctld.log" SlurmdLogFile: "/var/log/slurm/slurmd.log" SlurmdSpoolDir: "/var/spool/slurm/d" SlurmUser: "{{ slurm_user.name }}" SrunPortRange: "60000-61000" StateSaveLocation: "/var/spool/slurm/ctld" TaskPlugin: "task/affinity,task/cgroup" UnkillableStepTimeout: 120
-- Thanks,
Daniel Healy
IIUC you can't do that.
You either allow overcommit or you split your job in multiple, smaller jobs that fit.
The resources you're requesting must be available at the same time: if your job needs 2 CPUs and you want to run it in parallel, just use a job array. If you request 500 CPUs it means your job can not run with just 384.
Diego
Il 30/05/2024 11:41, Dan Healy via slurm-users ha scritto:
Following up on this in case anyone can provide some insight, please.
On Thu, May 16, 2024 at 8:32 AM Dan Healy <daniel.t.healy@gmail.com mailto:daniel.t.healy@gmail.com> wrote:
Hi there, SLURM community, I swear I've done this before, but now it's failing on a new cluster I'm deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I run `srun -n 500 hostname`, the task gets queued since there's not 500 available CPU. Wasn't there an option that allows for this to be run where the first 384 tasks execute, and then the remaining execute when resources free up? Here's my conf: # Slurm Cgroup Configs used on controllers and workers slurm_cgroup_config: CgroupAutomount: yes ConstrainCores: yes ConstrainRAMSpace: yes ConstrainSwapSpace: yes ConstrainDevices: yes # Slurm conf file settings slurm_config: AccountingStorageType: "accounting_storage/slurmdbd" AccountingStorageEnforce: "limits" AuthAltTypes: "auth/jwt" ClusterName: "cluster" AccountingStorageHost : "{{ hostvars[groups['controller'][0]].ansible_hostname }}" DefMemPerCPU: 1024 InactiveLimit: 120 JobAcctGatherType: "jobacct_gather/cgroup" JobCompType: "jobcomp/none" MailProg: "/usr/bin/mail" MaxArraySize: 40000 MaxJobCount: 100000 MinJobAge: 3600 ProctrackType: "proctrack/cgroup" ReturnToService: 2 SelectType: "select/cons_tres" SelectTypeParameters: "CR_Core_Memory" SlurmctldTimeout: 30 SlurmctldLogFile: "/var/log/slurm/slurmctld.log" SlurmdLogFile: "/var/log/slurm/slurmd.log" SlurmdSpoolDir: "/var/spool/slurm/d" SlurmUser: "{{ slurm_user.name <http://slurm_user.name> }}" SrunPortRange: "60000-61000" StateSaveLocation: "/var/spool/slurm/ctld" TaskPlugin: "task/affinity,task/cgroup" UnkillableStepTimeout: 120 -- Thanks, Daniel Healy
-- Thanks,
Daniel Healy