[slurm-users] srun: job steps and generic resources
Kraus, Sebastian
sebastian.kraus at tu-berlin.de
Mon Dec 16 05:25:38 UTC 2019
Dear Brian,
thanks for the detailed explanation. Shame on me, that I did not pass by the relevant description of the SallocDefaultCommand as filed at the mid of the slurm.conf man page.
@slurm developers: Maybe, it would be a good idea to link this paragraph directly into the salloc man page (https://slurm.schedmd.com/salloc.html) AND the documentation about the generic resources (https://slurm.schedmd.com/gres.html), dramatically increasing the probability to pass by this bit of valuable information within the documentation.
Best and thanks
Sebastian
Sebastian Kraus
Team IT am Institut für Chemie
Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin
Email: sebastian.kraus at tu-berlin.de
________________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Brian W. Johanson <bjohanso at psc.edu>
Sent: Friday, December 13, 2019 20:27
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] srun: job steps and generic resources
If those sruns are wrapped in salloc, they work correctly. The first srun can
be eliminated by adding SallocDefaultCommand for salloc (disabled in this
example with --no-shell)
SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --mpi=none --pty
$SHELL"
[user at login005 ~]$ salloc -p GPU --gres=gpu:p100:1 --no-shell
salloc: Good day
salloc: Pending job allocation 7052366
salloc: job 7052366 queued and waiting for resources
salloc: job 7052366 has been allocated resources
salloc: Granted job allocation 7052366
[user at login005 ~]$ srun --jobid 7052366 --gres=gpu:0 --pty bash
[user at gpu045 ~]$ nvidia-smi
No devices were found
[user at gpu045 ~]$ srun nvidia-smi
Fri Dec 13 14:19:45 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:87:00.0 Off | 0 |
| N/A 31C P0 26W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[user at gpu045 ~]$ exit
exit
[user at login005 ~]$ scancel 7052366
[user at login005 ~]$
On 12/13/19 11:48 AM, Kraus, Sebastian wrote:
> Dear Valantis,
> thanks for the explanation. But, I have to correct you about the second alternate approach:
> srun -ppartition -N1 -n4 --gres=gpu:0 --time=00:30:00 --mem=1G -Jjobname
> --pty /bin/bash -il
> srun --gres=gpu:1 -l hostname
>
> Naturally, this is not working and in consequence the "inner" srun job step throws an error about the generic resource being not available/allocatable:
> user at frontend02#-bash_4.2:~:[2]$ srun -pgpu -N1 -n4 --time=00:30:00 --mem=5G --gres=gpu:0 -Jjobname --pty /bin/bash -il
> user at gpu006#bash_4.2:~:[1]$ srun --gres=gpu:1 hostname
> srun: error: Unable to create step for job 18044554: Invalid generic resource (gres) specification
>
> Test it yourself. ;-)
>
> Best
> Sebastian
>
>
> Sebastian Kraus
> Team IT am Institut für Chemie
>
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin
>
> Email: sebastian.kraus at tu-berlin.de
>
>
> ________________________________________
> From: Chrysovalantis Paschoulas <c.paschoulas at fz-juelich.de>
> Sent: Friday, December 13, 2019 13:05
> To: Kraus, Sebastian
> Subject: Re: [slurm-users] srun: job steps and generic resources
>
> Hi Sebastian,
>
> the first srun uses the gres you requested and the second waits for it
> to be available again.
>
> You have to do either
> ```
> srun -ppartition -N1 -n4 --gres=gpu:1 --time=00:30:00 --mem=1G -Jjobname
> --pty /bin/bash -il
>
> srun --gres=gpu:0 -l hostname
> ```
>
> or
> ```
> srun -ppartition -N1 -n4 --gres=gpu:0 --time=00:30:00 --mem=1G -Jjobname
> --pty /bin/bash -il
>
> srun --gres=gpu:1 -l hostname
> ```
>
> Best Regards,
> Valantis
>
>
> On 13.12.19 12:44, Kraus, Sebastian wrote:
>> Dear all,
>> I am facing the following nasty problem.
>> I use to start interactive batch jobs via:
>> srun -ppartition -N1 -n4 --time=00:30:00 --mem=1G -Jjobname --pty /bin/bash -il
>> Then, explicitly starting a job step within such a session via:
>> srun -l hostname
>> works fine.
>> But, as soon as I add a generic resource to the job allocation as with:
>> srun -ppartition -N1 -n4 --gres=gpu:1 --time=00:30:00 --mem=1G -Jjobname --pty /bin/bash -il
>> an explict job step lauched as above via:
>> srun -l hostname
>> infinitely stalls/blocks.
>> Hope, anyone out there able to explain me this behavior.
>>
>> Thanks and best
>> Sebastian
>>
>>
>> Sebastian Kraus
>> Team IT am Institut für Chemie
>>
>> Technische Universität Berlin
>> Fakultät II
>> Institut für Chemie
>> Sekretariat C3
>> Straße des 17. Juni 135
>> 10623 Berlin
>>
>> Email: sebastian.kraus at tu-berlin.de
More information about the slurm-users
mailing list