<div dir="ltr"><p class="MsoNormal"><span lang="EN-US">Hello ~<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US">Please help me.<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US">Total GPU : 4<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">Large qos : 3 (max 3 gpus)<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">Base qos  : 2 (max 2 gpus)<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US">I have a total of four GPUs,<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">and when a job with a large QoS is using three GPUs and a job with a base QoS is created,<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">I want the large QoS job to wait for a certain period before the base QoS job starts.<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">However, as soon as the base QoS job is created, the large QoS job is immediately canceled without any waiting time.<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US">But in the slurmctld log, there is a grace time log.<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">[2023-11-02T11:37:36.589] debug:  setting 3600 sec preemption grace time for JobId=153 to reclaim resources for JobId=154<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US">Could you help me understand what might be going wrong?<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US">Here's my Slurm configuration details.<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">If you need any more information, please feel free to reply at any time.<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><b><span lang="EN-US">### /etc/slurm/slurm.conf ###<u></u><u></u></span></b></p><p class="MsoNormal"><span lang="EN-US"># cat /etc/slurm/slurm.conf<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"># slurm.conf file generated by configurator.html.<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"># Put this file on all nodes of your cluster.<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"># See the slurm.conf man page for more information.<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">#<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"># Global Configuration<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">ClusterName=cluster<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SlurmctldHost=master01<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SlurmUser=slurm<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">GresTypes=gpu<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">JobRequeue=1<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">ProctrackType=proctrack/cgroup<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">ReturnToService=2<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">StateSaveLocation=/NFS/slurm/ctld<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SwitchType=switch/none<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">TaskPlugin=task/cgroup,task/affinity<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US"># SLRUMCTLD<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SlurmctldPidFile=/var/spool/slurm/slurmctld.pid<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SlurmctldLogFile=/var/log/slurm//slurmctld.log<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SlurmctldTimeout=30<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SlurmctldDebug=debug5<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US"># SLURMD<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SlurmdLogFile=/var/log/slurm/slurmd.log<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SlurmdPidFile=/var/spool/slurm/slurmd.pid<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SlurmdSpoolDir=/var/spool/slurm/<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SlurmdTimeout=30<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SlurmdDebug=debug5<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US"># SCHEDULING<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SchedulerType=sched/backfill<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US"># JOB PRIORITY<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">PriorityType=priority/multifactor<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">PriorityWeightQOS=10000<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US"># Select Resource<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SelectType=select/cons_tres<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">SelectTypeParameters=CR_CPU<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US"># Job<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">JobAcctGatherType=jobacct_gather/cgroup<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">JobCompUser=slurm<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">JobCompType=jobcomp/filetxt<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">JobCompLoc=/NFS/slurm/job-comp/slurm_jobcomp.log<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">MinJobAge=3600<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US"># Account<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">AccountingStoreFlags=job_comment<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">AccountingStorageType=accounting_storage/slurmdbd<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">AccountingStorageHost=master01<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">AccountingStoragePass=/var/run/munge/munge.socket.2<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">AccountingStorageUser=slurm<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">AccountingStorageTRES=gres/gpu<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">AccountingStorageEnforce=limits,qos<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US"># COMPUTE NODES<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">NodeName=compute01 CPUs=8 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=15731 State=UNKNOWN<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">NodeName=compute02 CPUs=8 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=7679 State=UNKNOWN<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">NodeName=compute03 CPUs=8 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=7679 State=UNKNOWN<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">PartitionName=cpu Nodes=compute0[1-3] Default=NO MaxTime=INFINITE State=UP<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US">NodeName=gpu01 Gres=gpu:2 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=15731<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">NodeName=gpu02 Gres=gpu:1 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=15731<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">NodeName=gpu03 Gres=gpu:1 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=15731<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">PartitionName=gpu Nodes=gpu0[1-3] Default=YES MaxTime=INFINITE State=UP OverSubscribe=FORCE:4<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US"># Preemption<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">PreemptMode=CANCEL<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">PreemptType=preempt/qos<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><b><span lang="EN-US">### Slurmdbd ###<u></u><u></u></span></b></p><p class="MsoNormal"><span lang="EN-US"># sacctmgr show qos<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">      Name   Priority  GraceTime    Preempt   PreemptExemptTime PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA MaxSubmitPA       MinTRES<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">---------- ---------- ---------- ---------- ------------------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">    normal          0   00:00:00                                    cluster                                                        1.000000                                                                                                                                                                                                             <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">      base       1000   00:00:00      large                         cluster                                                        1.000000                                                                            gres/gpu=2                                                                                                                       <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">     large        100   01:00:00                                    cluster                                                        1.000000                                                                            gres/gpu=3                                                                                                                       <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">     small        500   00:00:00                                    cluster                                                        1.000000                                                                            gres/gpu=2                                                                                    <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">            <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"># sacctmgr show assoc<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   Cluster    Account       User  Partition     Share   Priority GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">---------- ---------- ---------- ---------- --------- ---------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   cluster       root                               1                                                                                                                       <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   cluster       root       root                    1                                                                                                                      <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   cluster    suser01                               1                                                                                                                      <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   cluster    suser01    suser01                    1                                                                                                                                                   base,large,small      base<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   cluster    suser02                               1                                                                                                                       <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   cluster    suser02    suser02                    1                                                                                                                                                         base,large      base<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   cluster    suser03                               1                                                                                                                      <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   cluster    suser03    suser03                    1                                                                                                                                                         base,large      base<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   cluster    suser04                               1                                                                                                                       <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   cluster    suser04    suser04                    1                                                                                                                                                         base,large      base<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   cluster      susol                               1                                                                                                                      <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   cluster      susol      susol                    1 <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   <u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">   <u></u><u></u></span></p><p class="MsoNormal"><b><span lang="EN-US">### Sample Job ###<u></u><u></u></span></b></p><p class="MsoNormal"><span lang="EN-US">suser01 $ cat 4-suser01-large-qos-srun_gpu-burn.sh<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">#!/bin/bash -l<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">#SBATCH -J 4-suser01-large-qos-srun_gpu-burn.sh<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">#SBATCH -G 3<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">#SBATCH -q large<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US">cd /NFS/gpu-burn<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">srun ./gpu_burn -d 120<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US">suser01 $ cat 4-suser01-base-qos-srun_gpu-burn.sh<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">#!/bin/bash -l<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">#SBATCH -J 4-suser01-base-qos-srun_gpu-burn<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">#SBATCH -G 2<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p><p class="MsoNormal"><span lang="EN-US">cd /NFS/gpu-burn<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">srun ./gpu_burn -d 120</span></p></div>