[slurm-users] Pending with resource problems

Prentice Bisbal pbisbal at pppl.gov
Wed Apr 17 15:21:59 UTC 2019


Mahmood,

What do you see as the problem here? To me, there is no problem and the 
scheduler is working exactly has it should. The reason "Resources" means 
that there are not enough computing resources available for your job to 
run right now, so the job is setting in the queue in the pending state 
waiting for the necessary resources to become available. This is exactly 
what schedulers are

As Andreas pointed out, looking at the output of 'scontrol show node 
compute-0-0' that you provided, compute-0-0 has 32 cores and 63 GB of 
RAM. Out of that 9 cores and 55 GB of RAM have already been allocated, 
leaving 23 cores and only 8 GB of RAM available for other jobs. The job 
you submitted requested 20 cores (tasks, technically) and 40 GB of RAM. 
Since compute-0-0 doesn't have enough RAM available, Slurm is keeping 
your job in the queue until enough RAM is available for it to run. This 
is exactly what Slurm should be doing.

Prentice

On 4/17/19 11:00 AM, Henkel, Andreas wrote:
> I think there isn’t enough memory.
> AllocTres Shows mem=55G
> And your job wants another 40G although the node only has 63G in total.
> Best,
> Andreas
>
> Am 17.04.2019 um 16:45 schrieb Mahmood Naderan <mahmood.nt at gmail.com 
> <mailto:mahmood.nt at gmail.com>>:
>
>> Hi,
>> Although it was fine for previous job runs, the following script now 
>> stuck as PD with the reason about resources.
>>
>> $ cat slurm_script.sh
>> #!/bin/bash
>> #SBATCH --output=test.out
>> #SBATCH --job-name=g09-test
>> #SBATCH --ntasks=20
>> #SBATCH --nodelist=compute-0-0
>> #SBATCH --mem=40GB
>> #SBATCH --account=z7
>> #SBATCH --partition=EMERALD
>> g09 test.gjf
>> $ sbatch slurm_script.sh
>> Submitted batch job 878
>> $ squeue
>>              JOBID PARTITION     NAME     USER ST       TIME  NODES 
>> NODELIST(REASON)
>>                878   EMERALD g09-test shakerza PD       0:00      1 
>> (Resources)
>>
>>
>>
>> However, all things look good.
>>
>> $ sacctmgr list association format=user,account,partition,grptres%20 
>> | grep shaker
>> shakerzad+      local
>> shakerzad+         z7    emerald cpu=20,mem=40G
>> $ scontrol show node compute-0-0
>> NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1
>>    CPUAlloc=9 CPUTot=32 CPULoad=8.89
>>    AvailableFeatures=rack-0,32CPUs
>>    ActiveFeatures=rack-0,32CPUs
>>    Gres=(null)
>>    NodeAddr=10.1.1.254 NodeHostName=compute-0-0 Version=18.08
>>    OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
>>    RealMemory=64261 AllocMem=56320 FreeMem=37715 Sockets=32 Boards=1
>>    State=MIXED ThreadsPerCore=1 TmpDisk=444124 Weight=20511900 
>> Owner=N/A MCS_label=N/A
>>    Partitions=CLUSTER,WHEEL,EMERALD,QUARTZ
>>    BootTime=2019-04-06T10:03:47 SlurmdStartTime=2019-04-06T10:05:54
>>    CfgTRES=cpu=32,mem=64261M,billing=47
>>    AllocTRES=cpu=9,mem=55G
>>    CapWatts=n/a
>>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> Any idea?
>>
>> Regards,
>> Mahmood
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190417/d9ee90f6/attachment-0001.html>


More information about the slurm-users mailing list