[slurm-users] Pending with resource problems
Mahmood Naderan
mahmood.nt at gmail.com
Wed Apr 17 17:10:42 UTC 2019
Yes. It seems that what user specifies, slurm will reserve that. The other
jobs realtime memory is less than what users had been specified. I thought
that slurm will dynamically handles that in order to put more jobs in
running state.
Regards,
Mahmood
On Wed, Apr 17, 2019 at 7:54 PM Prentice Bisbal <pbisbal at pppl.gov> wrote:
> Mahmood,
>
> What do you see as the problem here? To me, there is no problem and the
> scheduler is working exactly has it should. The reason "Resources" means
> that there are not enough computing resources available for your job to run
> right now, so the job is setting in the queue in the pending state waiting
> for the necessary resources to become available. This is exactly what
> schedulers are
>
> As Andreas pointed out, looking at the output of 'scontrol show node
> compute-0-0' that you provided, compute-0-0 has 32 cores and 63 GB of RAM.
> Out of that 9 cores and 55 GB of RAM have already been allocated, leaving
> 23 cores and only 8 GB of RAM available for other jobs. The job you
> submitted requested 20 cores (tasks, technically) and 40 GB of RAM. Since
> compute-0-0 doesn't have enough RAM available, Slurm is keeping your job in
> the queue until enough RAM is available for it to run. This is exactly what
> Slurm should be doing.
>
> Prentice
>
> On 4/17/19 11:00 AM, Henkel, Andreas wrote:
>
> I think there isn’t enough memory.
> AllocTres Shows mem=55G
> And your job wants another 40G although the node only has 63G in total.
> Best,
> Andreas
>
> Am 17.04.2019 um 16:45 schrieb Mahmood Naderan <mahmood.nt at gmail.com>:
>
> Hi,
> Although it was fine for previous job runs, the following script now stuck
> as PD with the reason about resources.
>
> $ cat slurm_script.sh
> #!/bin/bash
> #SBATCH --output=test.out
> #SBATCH --job-name=g09-test
> #SBATCH --ntasks=20
> #SBATCH --nodelist=compute-0-0
> #SBATCH --mem=40GB
> #SBATCH --account=z7
> #SBATCH --partition=EMERALD
> g09 test.gjf
> $ sbatch slurm_script.sh
> Submitted batch job 878
> $ squeue
> JOBID PARTITION NAME USER ST TIME NODES
> NODELIST(REASON)
> 878 EMERALD g09-test shakerza PD 0:00 1
> (Resources)
>
>
>
> However, all things look good.
>
> $ sacctmgr list association format=user,account,partition,grptres%20 |
> grep shaker
> shakerzad+ local
> shakerzad+ z7 emerald cpu=20,mem=40G
> $ scontrol show node compute-0-0
> NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1
> CPUAlloc=9 CPUTot=32 CPULoad=8.89
> AvailableFeatures=rack-0,32CPUs
> ActiveFeatures=rack-0,32CPUs
> Gres=(null)
> NodeAddr=10.1.1.254 NodeHostName=compute-0-0 Version=18.08
> OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
> RealMemory=64261 AllocMem=56320 FreeMem=37715 Sockets=32 Boards=1
> State=MIXED ThreadsPerCore=1 TmpDisk=444124 Weight=20511900 Owner=N/A
> MCS_label=N/A
> Partitions=CLUSTER,WHEEL,EMERALD,QUARTZ
> BootTime=2019-04-06T10:03:47 SlurmdStartTime=2019-04-06T10:05:54
> CfgTRES=cpu=32,mem=64261M,billing=47
> AllocTRES=cpu=9,mem=55G
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
> Any idea?
>
> Regards,
> Mahmood
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190417/f3fd9152/attachment.html>
More information about the slurm-users
mailing list