[slurm-users] [EXT] job_submit.lua - choice of error on failure / job_desc.gpus?

Tue Dec 8 02:46:43 UTC 2020

Hi Loris,

We have a completely separate test system, complete with a few worker
nodes, separate slurmctld/slurmdbd, so we can test Slurm upgrades etc.

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia

On Mon, 7 Dec 2020 at 19:01, Loris Bennett <loris.bennett at fu-berlin.de>
wrote:

> UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts
>
> Hi Sean,
>
> Thanks for the code - looks like you have put a lot more thought into it
> than I have into mine.  I'll certainly have to look at handling the
> 'tres-per-*' options.
>
> By the way, how to you do your testing?  As I don't have at test
> cluster, currently I'm doing "open heart" testing, but I really need a
> minimal test cluster, maybe using VMs.
>
> Cheers,
>
> Loris
>
> Sean Crosby <scrosby at unimelb.edu.au> writes:
>
> > Hi Loris,
> >
> > This is our submit filter for what you're asking. It checks for both
> --gres and --gpus
> >
> >   ESLURM_INVALID_GRES=2072
> >   ESLURM_BAD_TASK_COUNT=2025
> >   if ( job_desc.partition ~= slurm.NO_VAL ) then
> >     if (job_desc.partition ~= nil) then
> >       if (string.match(job_desc.partition,"gpgpu") or
> string.match(job_desc.partition,"gpgputest")) then
> >         --slurm.log_info("slurm_job_submit (lua): detect job for gpgpu
> partition")
> >         --Alert on invalid gpu count - eg: gpu:0 , gpu:p100:0
> >         if (job_desc.gres and string.find(job_desc.gres, "gpu")) then
> >           local numgpu = string.match(job_desc.gres, ":%d+$")
> >           if(numgpu ~= nil) then
> >               numgpu = numgpu:gsub(':', '')
> >               if ( tonumber(numgpu) < 1) then
> >                 slurm.log_user("Invalid GPGPU count specified in GRES,
> must be greater than 0")
> >                 return ESLURM_INVALID_GRES
> >               end
> >           end
> >         else
> >         --Alternative use gpus in new version of slurm
> >           if (job_desc.tres_per_node == nil) then
> >             if (job_desc.tres_per_socket == nil) then
> >               if (job_desc.tres_per_task == nil) then
> >                  slurm.log_user("You tried submitting to a GPGPU
> partition, but you didn't request one with GRES or GPUS")
> >                  return ESLURM_INVALID_GRES
> >                  else
> >                    if (job_desc.num_tasks == slurm.NO_VAL) then
> >                      slurm.user_msg("--gpus-per-task option requires
> --tasks specification")
> >                     return ESLURM_BAD_TASK_COUNT
> >                    end
> >                  end
> >               end
> >             end
> >          end
> >       end
> >    end
> >
> > Let me know if you improve it please? We're always on the hunt to fix up
> some of the logic in the submit filter.
> >
> > Cheers,
> > Sean
> >
> > --
> > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> > Research Computing Services | Business Services
> > The University of Melbourne, Victoria 3010 Australia
> >
> > On Fri, 4 Dec 2020 at 23:58, Loris Bennett <loris.bennett at fu-berlin.de>
> wrote:
> >
> >  UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts
> >
> >  Hi,
> >
> >  I want to reject jobs that don't specify any GPUs when accessing our GPU
> >  partition and have the following in job_submit.lua:
> >
> >    if (job_desc.partition == "gpu" and job_desc.gres == nil ) then
> >       slurm.log_user(string.format("Please request GPU resources in the
> partition 'gpu', " ..
> >                                       "e.g. '#SBATCH --gres=gpu:1' " ..
> >                                       "Please see 'man sbatch' for more
> details)"))
> >       slurm.log_info(string.format("check_parameters: user '%s' did not
> request GPUs in partition 'gpu'",
> >                                    username))
> >       return slurm.ERROR
> >    end
> >
> >  If GRES is not given for the GPU partition, this produces
> >
> >    sbatch: error: Please request GPU resources in the partition 'gpu',
> e.g. '#SBATCH --gres=gpu:1' Please see 'man sbatch' for more details)
> >    sbatch: error: Batch job submission failed: Unspecified error
> >
> >  My questions are:
> >
> >  1. Is there a better error to return?  The 'slurm.ERROR' produces the
> >     generic second error line above (slurm_errno.h just seems to have
> >     ESLURM_MISSING_TIME_LIMIT and ESLURM_INVALID_KNL as errors a plugin
> >     might raise).  This is misleading, since the error is in fact known
> >     and specific.
> >  2. I am right in thinking that 'job_desc' does not, as of 20.02.06, have
> >     a 'gpus' field corresponding to the sbatch/srun option '--gpus'?
> >
> >  Cheers,
> >
> >  Loris
> >
> >  --
> >  Dr. Loris Bennett (Hr./Mr.)
> >  ZEDAT, Freie Universität Berlin         Email
> loris.bennett at fu-berlin.de
> >
> --
> Dr. Loris Bennett (Hr./Mr.)
> ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201208/7135c7ae/attachment.htm>