[slurm-users] Jobs waiting while plenty of cpu and memory available
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Jul 9 06:35:48 UTC 2019
Hi Edward,
The squeue command tells you about job status. You can get extra
information using format options (see the squeue man-page). I like to
set this environment variable for squeue:
export SQUEUE_FORMAT="%.18i %.9P %.6q %.8j %.8u %.8a %.10T %.9Q %.10M
%.10V %.9l %.6D %.6C %m %R"
When some jobs are pending with Reason=Priority this means that other
jobs with a higher priority are waiting for the same resources (CPUs) to
become available, and they will have Pending=Resources in the squeue output.
When you have idle nodes, yet jobs are pending, this probably means that
you have defined your Slurm partitions inappropriately with incorrect
limits or resources - it's hard to guess. Use "scontrol show
partitions" to display partition settings.
/Ole
On 7/9/19 2:37 AM, Edward Ned Harvey (slurm) wrote:
> I have a cluster, where I submit a bunch (600) jobs, but the cluster
> only runs about 20 at a time. By using pestat, I can see there are a
> bunch of systems with plenty of available cpu and memory.
>
> Hostname Partition Node Num_CPU CPUload Memsize Freemem
>
> State Use/Tot (MB) (MB)
>
> pcomp13 batch* idle 0 72 8.19* 258207 202456
>
> pcomp14 batch* idle 0 72 0.00 258207 206558
>
> pcomp16 batch* idle 0 72 0.05 258207 230609
>
> pcomp17 batch* idle 0 72 8.51* 258207 184492
>
> pcomp18 batch mix 14 72 0.29* 258207 230575
>
> pcomp19 batch* idle 0 72 10.11* 258207 179604
>
> pcomp20 batch* idle 0 72 9.56* 258207 180961
>
> pcomp21 batch* idle 0 72 0.10 258207 227255
>
> pcomp25 batch* idle 0 72 0.07 258207 218035
>
> pcomp26 batch* idle 0 72 0.03 258207 226489
>
> pcomp27 batch* idle 0 72 0.25 258207 228580
>
> pcomp28 batch* idle 0 72 8.15* 258207 184306
>
> pcomp29 batch mix 2 72 0.01* 258207 226256
>
> How can I tell why jobs aren't running? "scontrol show job 123456" shows
> "JobState=PENDING Reason=Priority" which doesn't shed any light on the
> situation for me. The pending jobs have requested 1 cpu each and 2G of
> memory.
>
> Should I just restart slurm daemons? Or is there some way for me to see
> why these systems aren't running jobs?
>
More information about the slurm-users
mailing list