[slurm-users] Slurm and MPICH don't play well together (salloc)

Mccall, Kurt E. (MSFC-EV41) kurt.e.mccall at nasa.gov
Tue Dec 28 23:00:03 UTC 2021


Hi,

My MPICH jobs are being launched and the desired number of processes are created, but when one of those processes trys to spawn a new process using MPI_Comm_spawn(), that process just spins in the polling code deep within the MPICH library.   See the Slurm error message below.   This all works without problems on other clusters that have Torque as the process manager.   We are using Slurm 20.02.3 on redhat 4.18.0, and MPICH 4.0b1.

salloc: defined options
salloc: -------------------- --------------------
salloc: cpus-per-task       : 24
salloc: ntasks              : 2
salloc: verbose             : 1
salloc: -------------------- --------------------
salloc: end of defined options
salloc: Linear node selection plugin loaded with argument 4
salloc: select/cons_res loaded with argument 4
salloc: Cray/Aries node selection plugin loaded
salloc: select/cons_tres loaded with argument 4
salloc: Granted job allocation 34330
srun: error: Unable to create step for job 34330: Requested node configuration is not availableta

I'm wondering if the salloc command I am using is correct.   I intend for it to launch 2 processes, one per node, but reserve 24 cores on each node for the 2 launched processes to spawn new processes using MPI_Comm_spawn.   Could the reservation of all 24 cores make slurm or MPICH think that there are no more cores available?

salloc -ntasks=2 -cpus-per-task=24 -verbose runscript.bash ...


I think that our cluster's compute nodes are configured correctly -

$ scontrol show node=n001

NodeName=n001 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUTot=24 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=n001 NodeHostName=n001 Version=20.02.3
   OS=Linux 4.18.0-348.el8.x86_64 #1 SMP Mon Oct 4 12:17:22 EDT 2021
   RealMemory=128351 AllocMem=0 FreeMem=126160 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=normal,low,high
   BootTime=2021-12-21T14:25:05 SlurmdStartTime=2021-12-21T14:25:52
   CfgTRES=cpu=24,mem=128351M,billing=24
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Thanks for any help.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211228/b63008d1/attachment.htm>


More information about the slurm-users mailing list