[slurm-users] Jobs waiting while plenty of cpu and memory available

Tue Jul 9 06:35:48 UTC 2019

Hi Edward,

The squeue command tells you about job status.  You can get extra 
information using format options (see the squeue man-page).  I like to 
set this environment variable for squeue:

export SQUEUE_FORMAT="%.18i %.9P %.6q %.8j %.8u %.8a %.10T %.9Q %.10M 
%.10V %.9l %.6D %.6C %m %R"

When some jobs are pending with Reason=Priority this means that other 
jobs with a higher priority are waiting for the same resources (CPUs) to 
become available, and they will have Pending=Resources in the squeue output.

When you have idle nodes, yet jobs are pending, this probably means that 
you have defined your Slurm partitions inappropriately with incorrect 
limits or resources - it's hard to guess.  Use "scontrol show 
partitions" to display partition settings.

/Ole

On 7/9/19 2:37 AM, Edward Ned Harvey (slurm) wrote:
> I have a cluster, where I submit a bunch (600) jobs, but the cluster 
> only runs about 20 at a time. By using pestat, I can see there are a 
> bunch of systems with plenty of available cpu and memory.
> 
> Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem
> 
>                              State Use/Tot              (MB)     (MB)
> 
>   pcomp13          batch*     idle   0  72    8.19*   258207   202456
> 
>   pcomp14          batch*     idle   0  72    0.00    258207   206558
> 
>   pcomp16          batch*     idle   0  72    0.05    258207   230609
> 
>   pcomp17          batch*     idle   0  72    8.51*   258207   184492
> 
>   pcomp18           batch      mix  14  72    0.29*   258207   230575
> 
>   pcomp19          batch*     idle   0  72   10.11*   258207   179604
> 
>   pcomp20          batch*     idle   0  72    9.56*   258207   180961
> 
>   pcomp21          batch*     idle   0  72    0.10    258207   227255
> 
>   pcomp25          batch*     idle   0  72    0.07    258207   218035
> 
>   pcomp26          batch*     idle   0  72    0.03    258207   226489
> 
>   pcomp27          batch*     idle   0  72    0.25    258207   228580
> 
>   pcomp28          batch*     idle   0  72    8.15*   258207   184306
> 
>   pcomp29           batch      mix   2  72    0.01*   258207   226256
> 
> How can I tell why jobs aren't running? "scontrol show job 123456" shows 
> "JobState=PENDING Reason=Priority" which doesn't shed any light on the 
> situation for me. The pending jobs have requested 1 cpu each and 2G of 
> memory.
> 
> Should I just restart slurm daemons? Or is there some way for me to see 
> why these systems aren't running jobs?
>