[slurm-users] srun: job steps and generic resources

Fri Dec 13 19:27:58 UTC 2019

If those sruns are wrapped in salloc, they work correctly.  The first srun can 
be eliminated by adding SallocDefaultCommand for salloc (disabled in this 
example with --no-shell)
SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --mpi=none --pty 
$SHELL"


[user at login005 ~]$ salloc -p GPU --gres=gpu:p100:1 --no-shell
salloc: Good day
salloc: Pending job allocation 7052366
salloc: job 7052366 queued and waiting for resources
salloc: job 7052366 has been allocated resources
salloc: Granted job allocation 7052366
[user at login005 ~]$ srun --jobid 7052366 --gres=gpu:0 --pty bash
[user at gpu045 ~]$ nvidia-smi
No devices were found
[user at gpu045 ~]$ srun nvidia-smi
Fri Dec 13 14:19:45 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   31C    P0    26W / 250W |      0MiB / 16280MiB | 0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
|  GPU       PID   Type   Process name Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[user at gpu045 ~]$ exit
exit
[user at login005 ~]$ scancel 7052366
[user at login005 ~]$


On 12/13/19 11:48 AM, Kraus, Sebastian wrote:
> Dear Valantis,
> thanks for the explanation. But, I have to correct you about the second alternate approach:
> srun -ppartition -N1 -n4 --gres=gpu:0 --time=00:30:00 --mem=1G -Jjobname
> --pty /bin/bash -il
> srun --gres=gpu:1 -l hostname
>
> Naturally, this is not working and in consequence the "inner" srun job step throws an error about the generic resource being not available/allocatable:
> user at frontend02#-bash_4.2:~:[2]$ srun -pgpu -N1 -n4 --time=00:30:00 --mem=5G --gres=gpu:0 -Jjobname --pty /bin/bash -il
> user at gpu006#bash_4.2:~:[1]$  srun --gres=gpu:1 hostname
> srun: error: Unable to create step for job 18044554: Invalid generic resource (gres) specification
>
> Test it yourself. ;-)
>
> Best
> Sebastian
>
>
> Sebastian Kraus
> Team IT am Institut für Chemie
> Gebäude C, Straße des 17. Juni 115, Raum C7
>
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin
>
>
> Tel.: +49 30 314 22263
> Fax: +49 30 314 29309
> Email: sebastian.kraus at tu-berlin.de
>
>
> ________________________________________
> From: Chrysovalantis Paschoulas <c.paschoulas at fz-juelich.de>
> Sent: Friday, December 13, 2019 13:05
> To: Kraus, Sebastian
> Subject: Re: [slurm-users] srun: job steps and generic resources
>
> Hi Sebastian,
>
> the first srun uses the gres you requested and the second waits for it
> to be available again.
>
> You have to do either
> ```
> srun -ppartition -N1 -n4 --gres=gpu:1 --time=00:30:00 --mem=1G -Jjobname
> --pty /bin/bash -il
>
> srun --gres=gpu:0 -l hostname
> ```
>
> or
> ```
> srun -ppartition -N1 -n4 --gres=gpu:0 --time=00:30:00 --mem=1G -Jjobname
> --pty /bin/bash -il
>
> srun --gres=gpu:1 -l hostname
> ```
>
> Best Regards,
> Valantis
>
>
> On 13.12.19 12:44, Kraus, Sebastian wrote:
>> Dear all,
>> I am facing the following nasty problem.
>> I use to start interactive batch jobs via:
>> srun -ppartition -N1 -n4 --time=00:30:00 --mem=1G -Jjobname --pty /bin/bash -il
>> Then, explicitly starting a job step within such a session via:
>> srun -l hostname
>> works fine.
>> But, as soon as I add a generic resource  to the job allocation as with:
>> srun -ppartition -N1 -n4 --gres=gpu:1 --time=00:30:00 --mem=1G -Jjobname --pty /bin/bash -il
>> an explict job step lauched as above via:
>> srun -l hostname
>> infinitely stalls/blocks.
>> Hope, anyone out there able to explain me this behavior.
>>
>> Thanks and best
>> Sebastian
>>
>>
>> Sebastian Kraus
>> Team IT am Institut für Chemie
>>
>> Technische Universität Berlin
>> Fakultät II
>> Institut für Chemie
>> Sekretariat C3
>> Straße des 17. Juni 135
>> 10623 Berlin
>>
>> Email: sebastian.kraus at tu-berlin.de