[slurm-users] [EXT] job_submit.lua - choice of error on failure / job_desc.gpus?

Sean Crosby scrosby at unimelb.edu.au
Fri Dec 4 19:28:20 UTC 2020


Hi Loris,

This is our submit filter for what you're asking. It checks for both --gres
and --gpus

  ESLURM_INVALID_GRES=2072
  ESLURM_BAD_TASK_COUNT=2025
  if ( job_desc.partition ~= slurm.NO_VAL ) then
    if (job_desc.partition ~= nil) then
      if (string.match(job_desc.partition,"gpgpu") or
string.match(job_desc.partition,"gpgputest")) then
        --slurm.log_info("slurm_job_submit (lua): detect job for gpgpu
partition")
        --Alert on invalid gpu count - eg: gpu:0 , gpu:p100:0
        if (job_desc.gres and string.find(job_desc.gres, "gpu")) then
          local numgpu = string.match(job_desc.gres, ":%d+$")
          if(numgpu ~= nil) then
              numgpu = numgpu:gsub(':', '')
              if ( tonumber(numgpu) < 1) then
                slurm.log_user("Invalid GPGPU count specified in GRES, must
be greater than 0")
                return ESLURM_INVALID_GRES
              end
          end
        else
        --Alternative use gpus in new version of slurm
          if (job_desc.tres_per_node == nil) then
            if (job_desc.tres_per_socket == nil) then
              if (job_desc.tres_per_task == nil) then
                 slurm.log_user("You tried submitting to a GPGPU partition,
but you didn't request one with GRES or GPUS")
                 return ESLURM_INVALID_GRES
                 else
                   if (job_desc.num_tasks == slurm.NO_VAL) then
                     slurm.user_msg("--gpus-per-task option requires
--tasks specification")
                    return ESLURM_BAD_TASK_COUNT
                   end
                 end
              end
            end
         end
      end
   end

Let me know if you improve it please? We're always on the hunt to fix up
some of the logic in the submit filter.

Cheers,
Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Fri, 4 Dec 2020 at 23:58, Loris Bennett <loris.bennett at fu-berlin.de>
wrote:

> UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts
>
> Hi,
>
> I want to reject jobs that don't specify any GPUs when accessing our GPU
> partition and have the following in job_submit.lua:
>
>   if (job_desc.partition == "gpu" and job_desc.gres == nil ) then
>      slurm.log_user(string.format("Please request GPU resources in the
> partition 'gpu', " ..
>                                      "e.g. '#SBATCH --gres=gpu:1' " ..
>                                      "Please see 'man sbatch' for more
> details)"))
>      slurm.log_info(string.format("check_parameters: user '%s' did not
> request GPUs in partition 'gpu'",
>                                   username))
>      return slurm.ERROR
>   end
>
> If GRES is not given for the GPU partition, this produces
>
>   sbatch: error: Please request GPU resources in the partition 'gpu', e.g.
> '#SBATCH --gres=gpu:1' Please see 'man sbatch' for more details)
>   sbatch: error: Batch job submission failed: Unspecified error
>
> My questions are:
>
> 1. Is there a better error to return?  The 'slurm.ERROR' produces the
>    generic second error line above (slurm_errno.h just seems to have
>    ESLURM_MISSING_TIME_LIMIT and ESLURM_INVALID_KNL as errors a plugin
>    might raise).  This is misleading, since the error is in fact known
>    and specific.
> 2. I am right in thinking that 'job_desc' does not, as of 20.02.06, have
>    a 'gpus' field corresponding to the sbatch/srun option '--gpus'?
>
> Cheers,
>
> Loris
>
> --
> Dr. Loris Bennett (Hr./Mr.)
> ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201205/d9bafe34/attachment-0001.htm>


More information about the slurm-users mailing list