[slurm-users] Pending with resource problems

Mahmood Naderan mahmood.nt at gmail.com
Wed Apr 17 17:10:42 UTC 2019


Yes. It seems that what user specifies, slurm will reserve that. The other
jobs realtime memory is less than what users had been specified. I thought
that slurm will dynamically handles that in order to put more jobs in
running state.

Regards,
Mahmood




On Wed, Apr 17, 2019 at 7:54 PM Prentice Bisbal <pbisbal at pppl.gov> wrote:

> Mahmood,
>
> What do you see as the problem here? To me, there is no problem and the
> scheduler is working exactly has it should. The reason "Resources" means
> that there are not enough computing resources available for your job to run
> right now, so the job is setting in the queue in the pending state waiting
> for the necessary resources to become available. This is exactly what
> schedulers are
>
> As Andreas pointed out, looking at the output of 'scontrol show node
> compute-0-0' that you provided, compute-0-0 has 32 cores and 63 GB of RAM.
> Out of that 9 cores and 55 GB of RAM have already been allocated, leaving
> 23 cores and only 8 GB of RAM available for other jobs. The job you
> submitted requested 20 cores (tasks, technically) and 40 GB of RAM. Since
> compute-0-0 doesn't have enough RAM available, Slurm is keeping your job in
> the queue until enough RAM is available for it to run. This is exactly what
> Slurm should be doing.
>
> Prentice
>
> On 4/17/19 11:00 AM, Henkel, Andreas wrote:
>
> I think there isn’t enough memory.
> AllocTres Shows mem=55G
> And your job wants another 40G although the node only has 63G in total.
> Best,
> Andreas
>
> Am 17.04.2019 um 16:45 schrieb Mahmood Naderan <mahmood.nt at gmail.com>:
>
> Hi,
> Although it was fine for previous job runs, the following script now stuck
> as PD with the reason about resources.
>
> $ cat slurm_script.sh
> #!/bin/bash
> #SBATCH --output=test.out
> #SBATCH --job-name=g09-test
> #SBATCH --ntasks=20
> #SBATCH --nodelist=compute-0-0
> #SBATCH --mem=40GB
> #SBATCH --account=z7
> #SBATCH --partition=EMERALD
> g09 test.gjf
> $ sbatch slurm_script.sh
> Submitted batch job 878
> $ squeue
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>                878   EMERALD g09-test shakerza PD       0:00      1
> (Resources)
>
>
>
> However, all things look good.
>
> $ sacctmgr list association format=user,account,partition,grptres%20 |
> grep shaker
> shakerzad+      local
> shakerzad+         z7    emerald       cpu=20,mem=40G
> $ scontrol show node compute-0-0
> NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1
>    CPUAlloc=9 CPUTot=32 CPULoad=8.89
>    AvailableFeatures=rack-0,32CPUs
>    ActiveFeatures=rack-0,32CPUs
>    Gres=(null)
>    NodeAddr=10.1.1.254 NodeHostName=compute-0-0 Version=18.08
>    OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
>    RealMemory=64261 AllocMem=56320 FreeMem=37715 Sockets=32 Boards=1
>    State=MIXED ThreadsPerCore=1 TmpDisk=444124 Weight=20511900 Owner=N/A
> MCS_label=N/A
>    Partitions=CLUSTER,WHEEL,EMERALD,QUARTZ
>    BootTime=2019-04-06T10:03:47 SlurmdStartTime=2019-04-06T10:05:54
>    CfgTRES=cpu=32,mem=64261M,billing=47
>    AllocTRES=cpu=9,mem=55G
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
> Any idea?
>
> Regards,
> Mahmood
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190417/f3fd9152/attachment.html>


More information about the slurm-users mailing list