[slurm-users] Job Step Resource Requests are Ignored

Maria Semple maria at rstudio.com
Tue May 5 23:47:12 UTC 2020


Hi!

I'd like to set different resource limits for different steps of my job. A
sample script might look like this (e.g. job.sh):

#!/bin/bash
srun --cpus-per-task=1 --mem=1 echo "Starting..."
srun --cpus-per-task=4 --mem=250 --exclusive <do something complicated>
srun --cpus-per-task=1 --mem=1 echo "Finished."

Then I would run the script from the command line using the following
command: sbatch --ntasks=1 job.sh. I have observed that while none of the
steps appear to have limited memory (which I'm pretty sure has to do with
my proctrack plugin type), the second step runs and scontrol show step
<id>.1 shows the step has having been allocated 4 CPUs, in reality the step
is only able to use 1.

I have also observed the opposite. Running the following command, I can see
that the job step is able to use all CPUs allocated to the job, rather than
the one it was allocated itself:

sbatch --ntasks=1 --cpus-per-task=2 << EOF
#!/bin/bash
srun --cpus-per-task=1 <do something complicated>
EOF

My goal here is to be able to run a single job with 3 steps where the first
and last step are always executed, even if the second would not be run
because too many resources were requested.

Here is my slurm.conf, with commented out lines removed (this is just a
small test cluster with a single node on the same machine as the
controller):

SlurmctldHost=ubuntu
CredType=cred/munge
AuthType=auth/munge
EnforcePartLimits=ALL
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/spool/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/spool/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
TaskPlugin=task/affinity
TaskPluginParam=Sched
InactiveLimit=0
KillWait=30
MinJobAge=3600
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompHost=localhost
JobCompLoc=slurm_db
JobCompPort=3306
JobCompType=jobcomp/mysql
JobCompUser=slurm
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/spool/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/spool/slurm/slurmd/slurmd.log
NodeName=ubuntu CPUs=4 RealMemory=500 State=UNKNOWN
PartitionName=main Nodes=ubuntu Default=YES MaxTime=INFINITE State=UP
AllowGroups=maria

Any advice would be greatly appreciated! Thanks in advance!

-- 
Thanks,
Maria
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200505/51467988/attachment.htm>


More information about the slurm-users mailing list