Cloud elastic help - slurm-users

29 Jan 2025


      I have a new Slurm setup in AWS gov cloud that is not quite working. I will list a few factoids and maybe someone can suggest where to look next. The Troubleshooting page really has nothing relevant for elastic cloud deployments. The nodes are getting set to DOWN+CLOUD+POWERED_DOWN. Running a job does not launch a node in this state. I can force the nodes to launch with scontrol POWER_UP. The jobs will claim to run, then re-queue, but never complete. When the nodes boot I see them appear in slurmctl, but it soon reports connection lost. The node slurmd claims to be healthy, but the controller eventually just terminates them. You can ping in both directions with the hostnames. I've done 4 clusters. The first was torque/maui and the rest were slurm, but all were bare metal. This is my first attempt at cloud. We have ITAR data, so I can't use Amazon Parallel Computing because it is not offered in GovCloud.
https://cluster-in-the-cloud.readthedocs.io/en/latest/running.html
I had to fork this project because so much is obsolete, but it's mostly working for me now.
https://github.com/mntbighker
[root@mgmt ~]# journalctl -fu slurmctld
Jan 29 19:16:15 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug:  sched/backfill: _attempt_backfill: 1 jobs to backfill
Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug:  Updating partition uid access list
Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug:  purge_old_job: job file deletion is falling behind, 1 left to remove
Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug:  sched: Running job scheduler for full queue.
Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug:  sched/backfill: _attempt_backfill: beginning
Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug:  sched/backfill: _attempt_backfill: 1 jobs to backfill
Jan 29 19:16:50 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug:  sackd_mgr_dump_state: saved state of 0 nodes
Jan 29 19:17:15 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug:  sched/backfill: _attempt_backfill: beginning
Jan 29 19:17:15 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug:  sched/backfill: _attempt_backfill: 1 jobs to backfill
Jan 29 19:17:26 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: POWER: Power save mode: 4 nodes
[root@mgmt ~]# scontrol show node many-antelope-c5n-2xlarge-0001
NodeName=many-antelope-c5n-2xlarge-0001 CoresPerSocket=4
   CPUAlloc=0 CPUEfctv=8 CPUTot=8 CPULoad=0.00
   AvailableFeatures=shape=c5n.2xlarge,ad=None,arch=x86_64
   ActiveFeatures=shape=c5n.2xlarge,ad=None,arch=x86_64
   Gres=(null)
   NodeAddr=many-antelope-c5n-2xlarge-0001 NodeHostName=many-antelope-c5n-2xlarge-0001
   RealMemory=20034 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1
   State=DOWN+CLOUD+POWERED_DOWN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=production,debug,batch,long
   BootTime=None SlurmdStartTime=None
   LastBusyTime=Unknown ResumeAfterTime=None
   CfgTRES=cpu=8,mem=20034M,billing=8
   AllocTRES=
   CurrentWatts=0 AveWatts=0
Reason=Not responding [slurm@2025-01-29T17:49:26]
[root@mgmt ~]# scontrol show jobs
JobId=30 JobName=test.sl
   UserId=mwmoorcroft(1106) GroupId=nssam(1101) MCS_label=N/A
   Priority=1 Nice=0 Account=nssam QOS=(null)
   JobState=PENDING Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2025-01-29T18:58:26 EligibleTime=2025-01-29T18:58:26
   AccrueTime=2025-01-29T18:58:26
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-01-29T19:19:15 Scheduler=Backfill:*
   Partition=production AllocNode:Sid=ip-172-16-2-14.us-gov-east-1.compute.internal:7090
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=20034M,node=1,billing=1
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/mnt/shared/home/mwmoorcroft/test.sl
   WorkDir=/mnt/shared/home/mwmoorcroft
   StdErr=/mnt/shared/home/mwmoorcroft/slurm-30.out
   StdIn=/dev/null
   StdOut=/mnt/shared/home/mwmoorcroft/slurm-30.out