[slurm-users] EXTERNAL-Re: [External] scancel gpu jobs when gpu is not requested
Ratnasamy, Fritz
fritz.ratnasamy at chicagobooth.edu
Thu Aug 26 05:10:16 UTC 2021
Hi Michael,
Thanks for your message. Yes I was able to get all interactive sessions
killed quickly when trying other partitions and deactivating the prolog. I
read your example and I understand how it could possibly work (in the ex.,
maybe instead of looking if the gpu model is passed, we could look at the
number of gpu passed?), but where do i set up that function and where do i
call it?
Thanks,
*Fritz Ratnasamy*
Data Scientist
Information Technology
The University of Chicago
Booth School of Business
5807 S. Woodlawn
Chicago, Illinois 60637
Phone: +(1) 773-834-4556
On Wed, Aug 25, 2021 at 9:54 AM Michael Robbert <mrobbert at mines.edu> wrote:
> I doubt that it is a problem with your script and suspect that there is
> some weird interaction with scancel on interactive jobs. If you wanted to
> get to the bottom of that I’d suggest disabling the prolog and test by
> manually cancelling some interactive jobs.
>
> Another suggestion is to try a completely different approach to solve your
> problem. Why wait until the job starts to do the check? You can use a
> submit filter and it will alert the user as soon as they try to submit.
> That will prevent them from potentially having to wait in the queue if the
> cluster is busy and gets around having to cancel a running job. There is a
> description and simple example at the bottom of this page:
> https://slurm.schedmd.com/resource_limits.html
>
>
>
> Mike
>
>
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Ratnasamy, Fritz <fritz.ratnasamy at chicagobooth.edu>
> *Date: *Tuesday, August 24, 2021 at 21:00
> *To: *slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject: *[External] [slurm-users] scancel gpu jobs when gpu is not
> requested
>
> *CAUTION:* This email originated from outside of the Colorado School of
> Mines organization. Do not click on links or open attachments unless you
> recognize the sender and know the content is safe.
>
>
>
> Hello,
>
> I have written a script in my prolog.sh that cancels any slurm job if the
> parameter gres=gpu is not present. This is the script i added to my
> prolog.sh
>
> if [ $SLURM_JOB_PARTITION == "gpu" ]; then
> if [ ! -z "${GPU_DEVICE_ORDINAL}" ]; then
> echo "GPU ID used is ID: $GPU_DEVICE_ORDINAL "
> list_gpu=$(echo "$GPU_DEVICE_ORDINAL" | sed -e "s/,//g")
> Ngpu=$(expr length $list_gpu)
> else
> echo "No GPU selected"
> Ngpu=0
> fi
>
>
>
> # if 0 gpus were allocated, cancel the job
>
> if [ "$Ngpu" -eq "0" ]; then
> scancel ${SLURM_JOB_ID}
> fi
> fi
>
> What the code does is look at the number of gpus allocated, and if it is
> 0, cancel the job ID. It working fine if a user use sbatch submit.sh (and
> the submit.sh do not have the value --gres=gpu:1). However, when requesting
> an interactive session without gpus, the job is getting killed and the job
> hangs for 5-6 mins before getting killed.
>
> jlo at mfe01:~ $ srun --partition=gpu --pty bash --login
>
> srun: job 4631872 queued and waiting for resources
>
> srun: job 4631872 has been allocated resources
>
> srun: Force Terminated job 4631872 ...the killing hangs for 5-6minutes
>
> Is there anything wrong with my script? Why only when scancel an
> interactive session, I am seeing this hanging. I would like to remove the
> hanging
>
> Thanks
>
> *Fritz Ratnasamy*
>
> Data Scientist
>
> Information Technology
>
> The University of Chicago
>
> Booth School of Business
>
> 5807 S. Woodlawn
>
> Chicago, Illinois 60637
>
> Phone: +(1) 773-834-4556
>
> CAUTION: This email has originated outside of University email systems.
> Please do not click links or open attachments unless you recognize the
> sender and trust the contents as safe.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210826/ea0ad6c6/attachment.htm>
More information about the slurm-users
mailing list