[slurm-users] Pending with resource problems
Prentice Bisbal
pbisbal at pppl.gov
Wed Apr 17 15:21:59 UTC 2019
Mahmood,
What do you see as the problem here? To me, there is no problem and the
scheduler is working exactly has it should. The reason "Resources" means
that there are not enough computing resources available for your job to
run right now, so the job is setting in the queue in the pending state
waiting for the necessary resources to become available. This is exactly
what schedulers are
As Andreas pointed out, looking at the output of 'scontrol show node
compute-0-0' that you provided, compute-0-0 has 32 cores and 63 GB of
RAM. Out of that 9 cores and 55 GB of RAM have already been allocated,
leaving 23 cores and only 8 GB of RAM available for other jobs. The job
you submitted requested 20 cores (tasks, technically) and 40 GB of RAM.
Since compute-0-0 doesn't have enough RAM available, Slurm is keeping
your job in the queue until enough RAM is available for it to run. This
is exactly what Slurm should be doing.
Prentice
On 4/17/19 11:00 AM, Henkel, Andreas wrote:
> I think there isn’t enough memory.
> AllocTres Shows mem=55G
> And your job wants another 40G although the node only has 63G in total.
> Best,
> Andreas
>
> Am 17.04.2019 um 16:45 schrieb Mahmood Naderan <mahmood.nt at gmail.com
> <mailto:mahmood.nt at gmail.com>>:
>
>> Hi,
>> Although it was fine for previous job runs, the following script now
>> stuck as PD with the reason about resources.
>>
>> $ cat slurm_script.sh
>> #!/bin/bash
>> #SBATCH --output=test.out
>> #SBATCH --job-name=g09-test
>> #SBATCH --ntasks=20
>> #SBATCH --nodelist=compute-0-0
>> #SBATCH --mem=40GB
>> #SBATCH --account=z7
>> #SBATCH --partition=EMERALD
>> g09 test.gjf
>> $ sbatch slurm_script.sh
>> Submitted batch job 878
>> $ squeue
>> JOBID PARTITION NAME USER ST TIME NODES
>> NODELIST(REASON)
>> 878 EMERALD g09-test shakerza PD 0:00 1
>> (Resources)
>>
>>
>>
>> However, all things look good.
>>
>> $ sacctmgr list association format=user,account,partition,grptres%20
>> | grep shaker
>> shakerzad+ local
>> shakerzad+ z7 emerald cpu=20,mem=40G
>> $ scontrol show node compute-0-0
>> NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=9 CPUTot=32 CPULoad=8.89
>> AvailableFeatures=rack-0,32CPUs
>> ActiveFeatures=rack-0,32CPUs
>> Gres=(null)
>> NodeAddr=10.1.1.254 NodeHostName=compute-0-0 Version=18.08
>> OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
>> RealMemory=64261 AllocMem=56320 FreeMem=37715 Sockets=32 Boards=1
>> State=MIXED ThreadsPerCore=1 TmpDisk=444124 Weight=20511900
>> Owner=N/A MCS_label=N/A
>> Partitions=CLUSTER,WHEEL,EMERALD,QUARTZ
>> BootTime=2019-04-06T10:03:47 SlurmdStartTime=2019-04-06T10:05:54
>> CfgTRES=cpu=32,mem=64261M,billing=47
>> AllocTRES=cpu=9,mem=55G
>> CapWatts=n/a
>> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> Any idea?
>>
>> Regards,
>> Mahmood
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190417/d9ee90f6/attachment-0001.html>
More information about the slurm-users
mailing list