[slurm-users] scancel gpu jobs when gpu is not requested

Ratnasamy, Fritz fritz.ratnasamy at chicagobooth.edu
Wed Aug 25 02:57:58 UTC 2021


Hello,

I have written a script in my prolog.sh that cancels any slurm job if the
parameter gres=gpu is not present. This is the script i added to my
prolog.sh

if [ $SLURM_JOB_PARTITION == "gpu" ]; then
        if [ ! -z "${GPU_DEVICE_ORDINAL}" ]; then
                echo "GPU ID used is ID: $GPU_DEVICE_ORDINAL "
                list_gpu=$(echo "$GPU_DEVICE_ORDINAL" | sed -e "s/,//g")
                Ngpu=$(expr length $list_gpu)
        else
                echo "No GPU selected"
                Ngpu=0
        fi

       # if  0 gpus were allocated, cancel the job
        if [ "$Ngpu" -eq "0" ]; then
              scancel ${SLURM_JOB_ID}
    fi
fi

What the code does is look at the number of gpus allocated, and if it is 0,
cancel the job ID. It working fine if a user use sbatch submit.sh (and the
submit.sh do not have the value --gres=gpu:1). However, when requesting an
interactive session without gpus, the job is getting killed and the job
hangs for 5-6 mins before getting killed.

jlo at mfe01:~ $ srun --partition=gpu --pty bash --login
srun: job 4631872 queued and waiting for resources
srun: job 4631872 has been allocated resources
srun: Force Terminated job 4631872 ...the killing hangs for 5-6minutes

Is there anything wrong with my script? Why only when scancel an
interactive session, I am seeing this hanging. I would like to remove the
hanging
Thanks

*Fritz Ratnasamy*

Data Scientist

Information Technology

The University of Chicago

Booth School of Business

5807 S. Woodlawn

Chicago, Illinois 60637

Phone: +(1) 773-834-4556
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210824/f93a1fe6/attachment.htm>


More information about the slurm-users mailing list