[slurm-users] EXTERNAL-Re: [External] scancel gpu jobs when gpu is not requested

Ratnasamy, Fritz fritz.ratnasamy at chicagobooth.edu
Tue Aug 31 05:05:18 UTC 2021


Hi Michael,

Thanks for your message. Does the installation of the library
job_submit_lua.so need to have Slurm recompiled as well, ie, do I have to
compile slurm with the library job_submit_lua.so to be able to add any
plugin?I do not see it in the yum repo.
Thanks,

*Fritz Ratnasamy*

Data Scientist

Information Technology

The University of Chicago

Booth School of Business

5807 S. Woodlawn

Chicago, Illinois 60637

Phone: +(1) 773-834-4556


On Thu, Aug 26, 2021 at 9:18 AM Michael Robbert <mrobbert at mines.edu> wrote:

> You need to set the following option in slurm.conf
>
> *JobSubmitPlugins*
>
> A comma delimited list of job submission plugins to be used. The specified
> plugins will be executed in the order listed. These are intended to be
> site-specific plugins which can be used to set default job parameters
> and/or logging events. Sample plugins available in the distribution include
> "all_partitions", "defaults", "logging", "lua", and "partition". For
> examples of use, see the Slurm code in "src/plugins/job_submit" and
> "contribs/lua/job_submit*.lua" then modify the code to satisfy your needs.
> Slurm can be configured to use multiple job_submit plugins if desired,
> however the lua plugin will only execute one lua script named
> "job_submit.lua" located in the default script directory (typically the
> subdirectory "etc" of the installation directory). No job submission
> plugins are used by default.
>
>
>
>
>
> Then as this documentation states, put the job_submit.lua into your script
> directory. Mine is in /etc/slurm/. You may want to make sure that you have
> the job_submit_lua.so library installed with your build of Slurm. I agree
> that finding complete documentation for this feature is a little difficult.
>
>
>
> Mike
>
>
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Ratnasamy, Fritz <fritz.ratnasamy at chicagobooth.edu>
> *Date: *Wednesday, August 25, 2021 at 23:13
> *To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject: *Re: [slurm-users] EXTERNAL-Re: [External] scancel gpu jobs
> when gpu is not requested
>
> Hi Michael,
>
> Thanks for your message. Yes I was able to get all interactive sessions
> killed quickly when trying other partitions and deactivating the prolog. I
> read your example and I understand how it could possibly work (in the ex.,
> maybe instead of looking if the gpu model is passed, we could look at the
> number of gpu passed?), but where do i set up that function and where do i
> call it?
> Thanks,
>
> *Fritz Ratnasamy*
>
> Data Scientist
>
> Information Technology
>
> The University of Chicago
>
> Booth School of Business
>
> 5807 S. Woodlawn
>
> Chicago, Illinois 60637
>
> Phone: +(1) 773-834-4556
>
>
>
>
>
> On Wed, Aug 25, 2021 at 9:54 AM Michael Robbert <mrobbert at mines.edu>
> wrote:
>
> I doubt that it is a problem with your script and suspect that there is
> some weird interaction with scancel on interactive jobs. If you wanted to
> get to the bottom of that I’d suggest disabling the prolog and test by
> manually cancelling some interactive jobs.
>
> Another suggestion is to try a completely different approach to solve your
> problem. Why wait until the job starts to do the check? You can use a
> submit filter and it will alert the user as soon as they try to submit.
> That will prevent them from potentially having to wait in the queue if the
> cluster is busy and gets around having to cancel a running job. There is a
> description and simple example at the bottom of this page:
> https://slurm.schedmd.com/resource_limits.html
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fresource_limits.html&data=04%7C01%7Cmrobbert%40mines.edu%7C577fad20cd024e8f8d5a08d96850336c%7C997209e009b346239a4d76afa44a675c%7C0%7C0%7C637655515944014175%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vnuQFWtAkvixWlJaLCVa%2Bcmt0Zt97RCWhStXO1VLoss%3D&reserved=0>
>
>
>
> Mike
>
>
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Ratnasamy, Fritz <fritz.ratnasamy at chicagobooth.edu>
> *Date: *Tuesday, August 24, 2021 at 21:00
> *To: *slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject: *[External] [slurm-users] scancel gpu jobs when gpu is not
> requested
>
> *CAUTION:* This email originated from outside of the Colorado School of
> Mines organization. Do not click on links or open attachments unless you
> recognize the sender and know the content is safe.
>
>
>
> Hello,
>
> I have written a script in my prolog.sh that cancels any slurm job if the
> parameter gres=gpu is not present. This is the script i added to my
> prolog.sh
>
> if [ $SLURM_JOB_PARTITION == "gpu" ]; then
>         if [ ! -z "${GPU_DEVICE_ORDINAL}" ]; then
>                 echo "GPU ID used is ID: $GPU_DEVICE_ORDINAL "
>                 list_gpu=$(echo "$GPU_DEVICE_ORDINAL" | sed -e "s/,//g")
>                 Ngpu=$(expr length $list_gpu)
>         else
>                 echo "No GPU selected"
>                 Ngpu=0
>         fi
>
>
>
>        # if  0 gpus were allocated, cancel the job
>
>         if [ "$Ngpu" -eq "0" ]; then
>               scancel ${SLURM_JOB_ID}
>     fi
> fi
>
> What the code does is look at the number of gpus allocated, and if it is
> 0, cancel the job ID. It working fine if a user use sbatch submit.sh (and
> the submit.sh do not have the value --gres=gpu:1). However, when requesting
> an interactive session without gpus, the job is getting killed and the job
> hangs for 5-6 mins before getting killed.
>
> jlo at mfe01:~ $ srun --partition=gpu --pty bash --login
>
> srun: job 4631872 queued and waiting for resources
>
> srun: job 4631872 has been allocated resources
>
> srun: Force Terminated job 4631872 ...the killing hangs for 5-6minutes
>
> Is there anything wrong with my script? Why only when scancel an
> interactive session, I am seeing this hanging. I would like to remove the
> hanging
>
> Thanks
>
> *Fritz Ratnasamy*
>
> Data Scientist
>
> Information Technology
>
> The University of Chicago
>
> Booth School of Business
>
> 5807 S. Woodlawn
>
> Chicago, Illinois 60637
>
> Phone: +(1) 773-834-4556
>
> CAUTION: This email has originated outside of University email systems.
> Please do not click links or open attachments unless you recognize the
> sender and trust the contents as safe.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210831/a8b16e12/attachment-0001.htm>


More information about the slurm-users mailing list