Following up on this in case anyone can provide some insight, please.

On Thu, May 16, 2024 at 8:32 AM Dan Healy <daniel.t.healy@gmail.com> wrote:
Hi there, SLURM community,

I swear I've done this before, but now it's failing on a new cluster I'm deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I run `srun -n 500 hostname`, the task gets queued since there's not 500 available CPU. 

Wasn't there an option that allows for this to be run where the first 384 tasks execute, and then the remaining execute when resources free up?

Here's my conf:

# Slurm Cgroup Configs used on controllers and workers
slurm_cgroup_config:
  CgroupAutomount: yes
  ConstrainCores: yes
  ConstrainRAMSpace: yes
  ConstrainSwapSpace: yes
  ConstrainDevices: yes

# Slurm conf file settings
slurm_config:
  AccountingStorageType: "accounting_storage/slurmdbd"
  AccountingStorageEnforce: "limits"
  AuthAltTypes: "auth/jwt"
  ClusterName: "cluster"
  AccountingStorageHost : "{{ hostvars[groups['controller'][0]].ansible_hostname }}"
  DefMemPerCPU: 1024
  InactiveLimit: 120
  JobAcctGatherType: "jobacct_gather/cgroup"
  JobCompType: "jobcomp/none"
  MailProg: "/usr/bin/mail"
  MaxArraySize: 40000
  MaxJobCount: 100000
  MinJobAge: 3600
  ProctrackType: "proctrack/cgroup"
  ReturnToService: 2
  SelectType: "select/cons_tres"
  SelectTypeParameters: "CR_Core_Memory"
  SlurmctldTimeout: 30
  SlurmctldLogFile: "/var/log/slurm/slurmctld.log"
  SlurmdLogFile: "/var/log/slurm/slurmd.log"
  SlurmdSpoolDir: "/var/spool/slurm/d"
  SlurmUser: "{{ slurm_user.name }}"
  SrunPortRange: "60000-61000"
  StateSaveLocation: "/var/spool/slurm/ctld"
  TaskPlugin: "task/affinity,task/cgroup"
  UnkillableStepTimeout: 120

--
Thanks,

Daniel Healy


--
Thanks,

Daniel Healy