[slurm-users] [EXT] job_submit.lua - choice of error on failure / job_desc.gpus?

Mon Dec 7 08:01:05 UTC 2020

Hi Sean,

Thanks for the code - looks like you have put a lot more thought into it
than I have into mine.  I'll certainly have to look at handling the
'tres-per-*' options.

By the way, how to you do your testing?  As I don't have at test
cluster, currently I'm doing "open heart" testing, but I really need a
minimal test cluster, maybe using VMs.  

Cheers,

Loris

Sean Crosby <scrosby at unimelb.edu.au> writes:

> Hi Loris,
>
> This is our submit filter for what you're asking. It checks for both --gres and --gpus
>
>   ESLURM_INVALID_GRES=2072
>   ESLURM_BAD_TASK_COUNT=2025
>   if ( job_desc.partition ~= slurm.NO_VAL ) then
>     if (job_desc.partition ~= nil) then
>       if (string.match(job_desc.partition,"gpgpu") or string.match(job_desc.partition,"gpgputest")) then
>         --slurm.log_info("slurm_job_submit (lua): detect job for gpgpu partition")
>         --Alert on invalid gpu count - eg: gpu:0 , gpu:p100:0
>         if (job_desc.gres and string.find(job_desc.gres, "gpu")) then
>           local numgpu = string.match(job_desc.gres, ":%d+$")
>           if(numgpu ~= nil) then
>               numgpu = numgpu:gsub(':', '')
>               if ( tonumber(numgpu) < 1) then
>                 slurm.log_user("Invalid GPGPU count specified in GRES, must be greater than 0")
>                 return ESLURM_INVALID_GRES
>               end
>           end
>         else
>         --Alternative use gpus in new version of slurm
>           if (job_desc.tres_per_node == nil) then
>             if (job_desc.tres_per_socket == nil) then
>               if (job_desc.tres_per_task == nil) then
>                  slurm.log_user("You tried submitting to a GPGPU partition, but you didn't request one with GRES or GPUS")
>                  return ESLURM_INVALID_GRES
>                  else
>                    if (job_desc.num_tasks == slurm.NO_VAL) then
>                      slurm.user_msg("--gpus-per-task option requires --tasks specification")
>                     return ESLURM_BAD_TASK_COUNT
>                    end
>                  end
>               end
>             end
>          end
>       end
>    end
>
> Let me know if you improve it please? We're always on the hunt to fix up some of the logic in the submit filter.
>
> Cheers,
> Sean
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
> On Fri, 4 Dec 2020 at 23:58, Loris Bennett <loris.bennett at fu-berlin.de> wrote:
>
>  UoM notice: External email. Be cautious of links, attachments, or impersonation attempts
>
>  Hi,
>
>  I want to reject jobs that don't specify any GPUs when accessing our GPU
>  partition and have the following in job_submit.lua:
>
>    if (job_desc.partition == "gpu" and job_desc.gres == nil ) then
>       slurm.log_user(string.format("Please request GPU resources in the partition 'gpu', " ..
>                                       "e.g. '#SBATCH --gres=gpu:1' " ..
>                                       "Please see 'man sbatch' for more details)"))
>       slurm.log_info(string.format("check_parameters: user '%s' did not request GPUs in partition 'gpu'",
>                                    username))
>       return slurm.ERROR
>    end
>
>  If GRES is not given for the GPU partition, this produces
>
>    sbatch: error: Please request GPU resources in the partition 'gpu', e.g. '#SBATCH --gres=gpu:1' Please see 'man sbatch' for more details)
>    sbatch: error: Batch job submission failed: Unspecified error
>
>  My questions are:
>
>  1. Is there a better error to return?  The 'slurm.ERROR' produces the
>     generic second error line above (slurm_errno.h just seems to have
>     ESLURM_MISSING_TIME_LIMIT and ESLURM_INVALID_KNL as errors a plugin
>     might raise).  This is misleading, since the error is in fact known
>     and specific.
>  2. I am right in thinking that 'job_desc' does not, as of 20.02.06, have
>     a 'gpus' field corresponding to the sbatch/srun option '--gpus'?
>
>  Cheers,
>
>  Loris
>
>  -- 
>  Dr. Loris Bennett (Hr./Mr.)
>  ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de
>
-- 
Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de