[slurm-users] [EXT] job_submit.lua - choice of error on failure / job_desc.gpus?
Sean Crosby
scrosby at unimelb.edu.au
Fri Dec 4 19:28:20 UTC 2020
Hi Loris,
This is our submit filter for what you're asking. It checks for both --gres
and --gpus
ESLURM_INVALID_GRES=2072
ESLURM_BAD_TASK_COUNT=2025
if ( job_desc.partition ~= slurm.NO_VAL ) then
if (job_desc.partition ~= nil) then
if (string.match(job_desc.partition,"gpgpu") or
string.match(job_desc.partition,"gpgputest")) then
--slurm.log_info("slurm_job_submit (lua): detect job for gpgpu
partition")
--Alert on invalid gpu count - eg: gpu:0 , gpu:p100:0
if (job_desc.gres and string.find(job_desc.gres, "gpu")) then
local numgpu = string.match(job_desc.gres, ":%d+$")
if(numgpu ~= nil) then
numgpu = numgpu:gsub(':', '')
if ( tonumber(numgpu) < 1) then
slurm.log_user("Invalid GPGPU count specified in GRES, must
be greater than 0")
return ESLURM_INVALID_GRES
end
end
else
--Alternative use gpus in new version of slurm
if (job_desc.tres_per_node == nil) then
if (job_desc.tres_per_socket == nil) then
if (job_desc.tres_per_task == nil) then
slurm.log_user("You tried submitting to a GPGPU partition,
but you didn't request one with GRES or GPUS")
return ESLURM_INVALID_GRES
else
if (job_desc.num_tasks == slurm.NO_VAL) then
slurm.user_msg("--gpus-per-task option requires
--tasks specification")
return ESLURM_BAD_TASK_COUNT
end
end
end
end
end
end
end
Let me know if you improve it please? We're always on the hunt to fix up
some of the logic in the submit filter.
Cheers,
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Fri, 4 Dec 2020 at 23:58, Loris Bennett <loris.bennett at fu-berlin.de>
wrote:
> UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts
>
> Hi,
>
> I want to reject jobs that don't specify any GPUs when accessing our GPU
> partition and have the following in job_submit.lua:
>
> if (job_desc.partition == "gpu" and job_desc.gres == nil ) then
> slurm.log_user(string.format("Please request GPU resources in the
> partition 'gpu', " ..
> "e.g. '#SBATCH --gres=gpu:1' " ..
> "Please see 'man sbatch' for more
> details)"))
> slurm.log_info(string.format("check_parameters: user '%s' did not
> request GPUs in partition 'gpu'",
> username))
> return slurm.ERROR
> end
>
> If GRES is not given for the GPU partition, this produces
>
> sbatch: error: Please request GPU resources in the partition 'gpu', e.g.
> '#SBATCH --gres=gpu:1' Please see 'man sbatch' for more details)
> sbatch: error: Batch job submission failed: Unspecified error
>
> My questions are:
>
> 1. Is there a better error to return? The 'slurm.ERROR' produces the
> generic second error line above (slurm_errno.h just seems to have
> ESLURM_MISSING_TIME_LIMIT and ESLURM_INVALID_KNL as errors a plugin
> might raise). This is misleading, since the error is in fact known
> and specific.
> 2. I am right in thinking that 'job_desc' does not, as of 20.02.06, have
> a 'gpus' field corresponding to the sbatch/srun option '--gpus'?
>
> Cheers,
>
> Loris
>
> --
> Dr. Loris Bennett (Hr./Mr.)
> ZEDAT, Freie Universität Berlin Email loris.bennett at fu-berlin.de
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201205/d9bafe34/attachment-0001.htm>
More information about the slurm-users
mailing list