<div dir="ltr">Dear Robbert,<br><br>Thankyou so much for your response. I was so focused on sync of time that I missed the date on one of the nodes which was 1 day behind as you said. I have corrected it and now i get the following output in status. <br><br><b>(base) [nousheen@nousheen slurm]$ systemctl status slurmctld.service -l</b><br>● slurmctld.service - Slurm controller daemon<br>   Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)<br>   Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago<br> Main PID: 19475 (slurmctld)<br>    Tasks: 10<br>   Memory: 4.5M<br>   CGroup: /system.slice/slurmctld.service<br>           ├─19475 /usr/sbin/slurmctld -D -s<br>           └─19538 slurmctld: slurmscriptd  <br><br>Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill: _start_job: Started JobId=106 in debug on 101<br>Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=106 WEXITSTATUS 1<br>Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=106 done<br>Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate JobId=107 NodeList=101 #CPUs=8 Partition=debug<br>Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate JobId=108 NodeList=105 #CPUs=8 Partition=debug<br>Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate JobId=109 NodeList=nousheen #CPUs=8 Partition=debug<br>Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=107 WEXITSTATUS 1<br>Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=107 done<br>Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=108 WEXITSTATUS 1<br>Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=108 done<br><br>I have total four nodes one of which is the server node. After submitting a job, the job only runs at my server compute node while all the other nodes are IDLE, DOWN or nonresponding. The details are given below:<br><br><b>(base) [nousheen@nousheen slurm]$ scontrol show nodes</b><br>NodeName=101 Arch=x86_64 CoresPerSocket=6 <br>   CPUAlloc=0 CPUTot=12 CPULoad=0.01<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=(null)<br>   NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4<br>   OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022 <br>   RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1<br>   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   Partitions=debug <br>   BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57<br>   LastBusyTime=2022-12-02T00:58:31<br>   CfgTRES=cpu=12,mem=1M,billing=12<br>   AllocTRES=<br>   CapWatts=n/a<br>   CurrentWatts=0 AveWatts=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br>NodeName=104 CoresPerSocket=6 <br>   CPUAlloc=0 CPUTot=12 CPULoad=N/A<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=(null)<br>   NodeAddr=192.168.60.114 NodeHostName=104 <br>   RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1<br>   State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   Partitions=debug <br>   BootTime=None SlurmdStartTime=None<br>   LastBusyTime=2022-12-01T21:37:35<br>   CfgTRES=cpu=12,mem=1M,billing=12<br>   AllocTRES=<br>   CapWatts=n/a<br>   CurrentWatts=0 AveWatts=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>   Reason=Not responding [slurm@2022-12-01T16:22:28]<br><br>NodeName=105 Arch=x86_64 CoresPerSocket=6 <br>   CPUAlloc=0 CPUTot=12 CPULoad=1.08<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=(null)<br>   NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4<br>   OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 <br>   RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1<br>   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   Partitions=debug <br>   BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30<br>   LastBusyTime=2022-12-01T21:47:11<br>   CfgTRES=cpu=12,mem=1M,billing=12<br>   AllocTRES=<br>   CapWatts=n/a<br>   CurrentWatts=0 AveWatts=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br>NodeName=nousheen Arch=x86_64 CoresPerSocket=6 <br>   CPUAlloc=8 CPUTot=12 CPULoad=6.73<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=(null)<br>   NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5<br>   OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 <br>   RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1<br>   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   Partitions=debug <br>   BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42<br>   LastBusyTime=2022-12-01T21:37:39<br>   CfgTRES=cpu=12,mem=1M,billing=12<br>   AllocTRES=cpu=8<br>   CapWatts=n/a<br>   CurrentWatts=0 AveWatts=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br>Where as this command shows only one node on which job is running:<br><br><b>(base) [nousheen@nousheen slurm]$ squeue -j</b><br>             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)<br>               109     debug   SRBD-4 nousheen  R    3:17:48      1 nousheen<br><br>Can you please guide me as to why my compute nodes are down and not working?<br><br>Thank you for your time.<br clear="all"><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><br></div><div dir="ltr"><br></div><div dir="ltr">Best Regards,</div><div dir="ltr"><span style="font-family:arial;font-size:small">Nousheen Parvaiz</span><br style="font-family:arial;font-size:small"><br></div></div></div></div></div></div><br></div><div hspace="streak-pt-mark" style="max-height:1px"><img alt="" style="width:0px;max-height:0px;overflow:hidden" src="https://mailfoogae.appspot.com/t?sender=abm91c2hlZW5wYXJ2YWl6QGdtYWlsLmNvbQ%3D%3D&type=zerocontent&guid=53c125aa-a27b-40f2-8873-0c546c11254d"><font color="#ffffff" size="1">ᐧ</font></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <<a href="mailto:mrobbert@mines.edu">mrobbert@mines.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg-3781780940283637406"><div lang="EN-US" style="overflow-wrap: break-word;"><div class="m_-3781780940283637406WordSection1"><p class="MsoNormal"><span style="font-size:11pt">I believe that the error you need to pay attention to for this issue is this line:<u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:13.5pt;color:black">Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for out of sync clocks</span><span style="font-size:11pt"><u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11pt">It looks like your compute nodes clock is a full day ahead of your controller node. Dec. 2 instead of Dec. 1. The clocks need to be in sync for munge to work.<u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p><div><div><p class="MsoNormal"><b><span style="font-size:11pt;color:rgb(0,32,96)">Mike Robbert</span></b><span style="font-size:11pt;color:black"><u></u><u></u></span></p><p class="MsoNormal"><b><span style="font-size:11pt;color:rgb(0,32,96)">Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research Computing</span></b><span style="font-size:11pt;color:black"><u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11pt;color:rgb(118,113,113)">Information and Technology Solutions (ITS)</span><span style="font-size:11pt;color:black"><u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11pt;color:rgb(118,113,113)">303-273-3786 | </span><span style="font-size:11pt;color:black"><a href="mailto:mrobbert@mines.edu" target="_blank"><span style="color:rgb(5,99,193)">mrobbert@mines.edu</span></a></span><span style="font-size:11pt;color:rgb(118,113,113)"> </span><span style="font-size:12pt;color:rgb(118,113,113)"> </span><span style="font-size:11pt;color:black"><u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11pt;color:black"><img border="0" width="208" height="38" style="width: 2.1666in; height: 0.3958in;" id="m_-3781780940283637406Picture_x0020_1" src="cid:184cf4cca294cff311" alt="A close up of a sign

Description automatically generated"><u></u><u></u></span></p><p class="MsoNormal"><b><span style="font-size:11pt;color:rgb(43,65,96)">Our values:</span></b><span style="font-size:11pt;color:rgb(43,65,96)"> </span><span style="font-size:11pt;color:rgb(118,113,113)">Trust | Integrity | Respect | Responsibility</span><span style="font-size:11pt;color:black"><u></u><u></u></span></p></div></div><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p><div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(181,196,223);padding:3pt 0in 0in"><p class="MsoNormal" style="margin-bottom:12pt"><b><span style="font-size:12pt;color:black">From: </span></b><span style="font-size:12pt;color:black">slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>> on behalf of Nousheen <<a href="mailto:nousheenparvaiz@gmail.com" target="_blank">nousheenparvaiz@gmail.com</a>><br><b>Date: </b>Thursday, December 1, 2022 at 06:19<br><b>To: </b>Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br><b>Subject: </b>[External] [slurm-users] ERROR: slurmctld: auth/munge: _print_cred: DECODED<u></u><u></u></span></p></div><div style="border:1pt solid rgb(156,101,0);padding:2pt"><p class="MsoNormal" style="line-height:12pt;background:rgb(255,235,156)"><b><span style="color:rgb(156,101,0)">CAUTION:</span></b><span style="color:black"> This email originated from outside of the Colorado School of Mines organization. Do not click on links or open attachments unless you recognize the sender and know the content is safe.<u></u><u></u></span></p></div><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p><div><div><div><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt">Hello Everyone,<u></u><u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt">I am using slurm version 21.08.5 and Centos 7.<u></u><u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt"> I successfully start slurmd on all compute nodes but when I start slurmctld on server node it gives the following error:<u></u><u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p></div><div><p class="MsoNormal"><b><span style="font-size:11pt">(base) [nousheen@nousheen ~]$ systemctl status slurmctld.service -l</span></b><span style="font-size:11pt"><br>● slurmctld.service - Slurm controller daemon<br>   Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)<br>   Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h 16min ago<br> Main PID: 1631 (slurmctld)<br>    Tasks: 10<br>   Memory: 4.0M<br>   CGroup: /system.slice/slurmctld.service<br>           </span><span style="font-size:11pt;font-family:"MS Gothic"">├</span><span style="font-size:11pt">─1631 /usr/sbin/slurmctld -D -s<br>           └─1818 slurmctld: slurmscriptd  <br><br>Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge: _print_cred: DECODED: Thu Dec 01 16:17:19 2022<br>Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for out of sync clocks<br>Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge decode failed: Rewound credential<br>Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge: _print_cred: ENCODED: Fri Dec 02 16:16:55 2022<br>Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge: _print_cred: DECODED: Thu Dec 01 16:17:20 2022<br>Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for out of sync clocks<br>Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge decode failed: Rewound credential<br>Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge: _print_cred: ENCODED: Fri Dec 02 16:16:56 2022<br>Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge: _print_cred: DECODED: Thu Dec 01 16:17:21 2022<br>Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for out of sync clocks<u></u><u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt">When I run the following command on compute nodes I get the following output:<u></u><u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt"> [gpu101@101 ~]$<b> munge -n | unmunge</b><u></u><u></u></span></p></div><p class="MsoNormal"><span style="font-size:11pt">STATUS:           Success (0)<br>ENCODE_HOST:      ??? (0.0.0.101)<br>ENCODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)<br>DECODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)<br>TTL:              300<br>CIPHER:           aes128 (4)<br>MAC:              sha1 (3)<br>ZIP:              none (0)<br>UID:              gpu101 (1000)<br>GID:              gpu101 (1000)<br>LENGTH:           0<u></u><u></u></span></p><div><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt">Is this error because the encode_host name has question marks and the IP is also not picked correctly by munge. How can I correct this? All the nodes keep non-responding when I run a job. However, I have all the clocks synced across the cluster. <u></u><u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt">I am new to slurm. Kindly guide me in this matter.<u></u><u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt"> <u></u><u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p></div><p class="MsoNormal"><span style="font-size:11pt"><br clear="all"><u></u><u></u></span></p><div><div><div><div><div><div><p class="MsoNormal"><span style="font-size:11pt">Best Regards,<u></u><u></u></span></p></div><div><p class="MsoNormal"><span style="font-size:12pt;font-family:Arial,sans-serif">Nousheen Parvaiz<br>Ph.D. Scholar</span><span style="font-size:11pt"> <u></u><u></u></span></p><div><p class="MsoNormal"><span style="font-size:12pt;font-family:Arial,sans-serif"><u></u> <u></u></span></p></div></div></div></div></div></div></div></div><div><p class="MsoNormal"><span style="font-size:11pt"><img border="0" id="m_-3781780940283637406_x0000_i1025" src="https://mailfoogae.appspot.com/t?sender=abm91c2hlZW5wYXJ2YWl6QGdtYWlsLmNvbQ%3D%3D&type=zerocontent&guid=c746fbb9-88d5-4ea9-8121-3f7f2cff3fcb"></span><span style="font-size:7.5pt;font-family:"Euphemia UCAS",sans-serif;color:white">ᐧ</span><span style="font-size:11pt"><u></u><u></u></span></p></div></div></div></div></div></blockquote></div>