I have a new Slurm setup in AWS gov cloud that is not quite working. I will list a few factoids and maybe someone can suggest where to look next. The Troubleshooting page really has nothing relevant for elastic cloud deployments. The nodes are getting set to DOWN+CLOUD+POWERED_DOWN. Running a job does not launch a node in this state. I can force the nodes to launch with scontrol POWER_UP. The jobs will claim to run, then re-queue, but never complete. When the nodes boot I see them appear in slurmctl, but it soon reports connection lost. The node slurmd claims to be healthy, but the controller eventually just terminates them. You can ping in both directions with the hostnames. I've done 4 clusters. The first was torque/maui and the rest were slurm, but all were bare metal. This is my first attempt at cloud. We have ITAR data, so I can't use Amazon Parallel Computing because it is not offered in GovCloud. https://cluster-in-the-cloud.readthedocs.io/en/latest/running.html I had to fork this project because so much is obsolete, but it's mostly working for me now. https://github.com/mntbighker
[root@mgmt ~]# journalctl -fu slurmctld Jan 29 19:16:15 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug: sched/backfill: _attempt_backfill: 1 jobs to backfill Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug: Updating partition uid access list Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug: purge_old_job: job file deletion is falling behind, 1 left to remove Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug: sched: Running job scheduler for full queue. Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug: sched/backfill: _attempt_backfill: beginning Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug: sched/backfill: _attempt_backfill: 1 jobs to backfill Jan 29 19:16:50 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug: sackd_mgr_dump_state: saved state of 0 nodes Jan 29 19:17:15 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug: sched/backfill: _attempt_backfill: beginning Jan 29 19:17:15 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: debug: sched/backfill: _attempt_backfill: 1 jobs to backfill Jan 29 19:17:26 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: POWER: Power save mode: 4 nodes
[root@mgmt ~]# scontrol show node many-antelope-c5n-2xlarge-0001 NodeName=many-antelope-c5n-2xlarge-0001 CoresPerSocket=4 CPUAlloc=0 CPUEfctv=8 CPUTot=8 CPULoad=0.00 AvailableFeatures=shape=c5n.2xlarge,ad=None,arch=x86_64 ActiveFeatures=shape=c5n.2xlarge,ad=None,arch=x86_64 Gres=(null) NodeAddr=many-antelope-c5n-2xlarge-0001 NodeHostName=many-antelope-c5n-2xlarge-0001 RealMemory=20034 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1 State=DOWN+CLOUD+POWERED_DOWN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=production,debug,batch,long BootTime=None SlurmdStartTime=None LastBusyTime=Unknown ResumeAfterTime=None CfgTRES=cpu=8,mem=20034M,billing=8 AllocTRES= CurrentWatts=0 AveWatts=0
Reason=Not responding [slurm@2025-01-29T17:49:26]
[root@mgmt ~]# scontrol show jobs JobId=30 JobName=test.sl UserId=mwmoorcroft(1106) GroupId=nssam(1101) MCS_label=N/A Priority=1 Nice=0 Account=nssam QOS=(null) JobState=PENDING Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2025-01-29T18:58:26 EligibleTime=2025-01-29T18:58:26 AccrueTime=2025-01-29T18:58:26 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-01-29T19:19:15 Scheduler=Backfill:* Partition=production AllocNode:Sid=ip-172-16-2-14.us-gov-east-1.compute.internal:7090 ReqNodeList=(null) ExcNodeList=(null) NodeList= NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=20034M,node=1,billing=1 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/mnt/shared/home/mwmoorcroft/test.sl WorkDir=/mnt/shared/home/mwmoorcroft StdErr=/mnt/shared/home/mwmoorcroft/slurm-30.out StdIn=/dev/null StdOut=/mnt/shared/home/mwmoorcroft/slurm-30.out
On Jan 29, 2025, at 16:49, mark.w.moorcroft--- via slurm-users slurm-users@lists.schedmd.com wrote:
It helps to unblock port 6818 on the node image. #eyeroll
Bear in mind there are also port requirements on the login node too if you plan to run interactive jobs (they will otherwise hang when executed).
-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'