[slurm-users] sbatch: error: memory allocation failure

Yap, Mike M.Yap at massey.ac.nz
Mon Jun 7 23:46:04 UTC 2021


Hi All

Can another advise the possibilities of me encountering the error message as below when submitting a job ?
sbatch: error: memory allocation failure
The same script use work perfectly fine until I include  #SBATCH --nodelist=(compute[015-046])  (once removed it work as it should)

The issues

  1.  For the current setup, I have specific resources available for each compute node
     *   (NodeName=compute[007-014] Procs=36 CoresPerSocket=18 RealMemory=384000 ThreadsPerCore=1 Boards=1 SocketsPerBoard=2) - newer model
     *   (NodeName=compute[001-006] Procs=16 CoresPerSocket=18 RealMemory=128000 ThreadsPerCore=1 Boards=1 SocketsPerBoard=2)
  2.  I have same resources sharing between multiple queue (working fine)
  3.  When running on parallel job, the exact same job run when assigned to the same node category (ie exclusively on 1a or 1b)
  4.  When running the exact same jobs but assigned between 1a and 1b, the job will run on 1b node but no activities on 1a

Any suggestion

Thanks
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210607/5434e792/attachment.htm>


More information about the slurm-users mailing list