[slurm-users] gres with docker problem
Marcus Wagner
wagner at itc.rwth-aachen.de
Mon Jan 7 07:11:55 MST 2019
But that means, the docker container runs outside the cgroup of the
slurm job. Thus there exists no restriction to the container, so it can
use all resources!
If e.g. one badly configured job requests one GPU, but uses all four,
the following jobs on the node will all crash, because they cannot find
a free gpu. What about restriction to cores, can the container use all
cores? What about memory cgroups, can every container then use all the
memory of the host?
If this is the case, in my opinion docker cannot be used on shared
systems but only on exclusive nodes.
Best
Marcus
On 01/07/2019 05:26 AM, 허웅 wrote:
>
> I agree with Chris's opinion.
>
> I could find out the reason.
>
> As Chris said, the problem is cgroup.
>
> when I request a job to slurm that using 1 gres:gpu, slurm assign the
> job to the node who can have enough resource.
>
> when slurm assign a job to the node, slurm gives resource information
> to node after make a cgroup environment.
>
> But, the problem is that Docker uses their own cgroup config.
>
> That's why I could get right information through slurm-side not
> Docker-side.
>
> Here is my workaround code for get right information in the Docker-side.
>
> scontrol show job=$SLURM_JOBID --details | grep GRES_IDX | awk -F "IDX:" '{print $2}' | awk -F ")" '{print $1}'
> scontrol show with --details option can get GRES_IDX.
> So, I've used this information in my application.
> Please refer to this command if someone is suffering this.
>
> -----Original Message-----
> *From:* "Chris Samuel"<chris at csamuel.org>
> *To:* <slurm-users at lists.schedmd.com>;
> *Cc:*
> *Sent:* 2019-01-07 (월) 11:59:09
> *Subject:* Re: [slurm-users] gres with docker problem
>
> On 4/1/19 5:48 am, Marcin Stolarek wrote:
>
> > I think that the main reason is the lack of access to some /dev "files"
> > in your docker container. For singularity nvidia plugin is required,
> > maybe there is something similar for docker...
>
> That's unlikely, the problem isn't that nvidia-smi isn't working in
> Docker because of a lack of device files, the problem is that it's
> seeing all 4 GPUs and thus is no longer being controlled by the device
> cgroup that Slurm is creating.
>
> --
> Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
>
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190107/37ed2d12/attachment-0001.html>
More information about the slurm-users
mailing list