[slurm-users] srun: job steps and generic resources

Fri Dec 13 14:37:35 UTC 2019

The gres resource is allocated by the first srun, the second srun is waiting for 
the gres allocation to be free.

If you were to replace that second srun with 'srun -l --gres=gpu:0 hostname' it 
will complete, but it will not have access to the GPUs.

You can use salloc instead of the srun to create the allocation and issue an 
'srun --gres=gpu:0 --pty bash', the second srun will not hang as the gres 
resource is avail.
But you will not have access to the GPUs within your shell as it is not 
allocated to that srun instance.

A workaround is to 'export SLURM_GRES=gpu:0' in the shell where the srun is 
hanging.

-b

On 12/13/19 6:44 AM, Kraus, Sebastian wrote:
> Dear all,
> I am facing the following nasty problem.
> I use to start interactive batch jobs via:
> srun -ppartition -N1 -n4 --time=00:30:00 --mem=1G -Jjobname --pty /bin/bash -il
> Then, explicitly starting a job step within such a session via:
> srun -l hostname
> works fine.
> But, as soon as I add a generic resource  to the job allocation as with:
> srun -ppartition -N1 -n4 --gres=gpu:1 --time=00:30:00 --mem=1G -Jjobname --pty /bin/bash -il
> an explict job step lauched as above via:
> srun -l hostname
> infinitely stalls/blocks.
> Hope, anyone out there able to explain me this behavior.
>
> Thanks and best
> Sebastian
>
>
> Sebastian Kraus
> Team IT am Institut für Chemie
>
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin
>
> Email: sebastian.kraus at tu-berlin.de