<div dir="ltr"><br clear="all"><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr">Dear Ole,<br><br>Thank you so much for your response. I have now adjusted the RealMemory in the slurm.conf which was set by default previously. Your insight was really helpful. Now, when I submit the job, it is running on three nodes but one node (104) is not responding. The details of some commands are given below.<br><br><br><b>[root@nousheen ~]# squeue -j</b><br>             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)<br>               120     debug   SRBD-1 nousheen  R       0:54      1 101<br>               121     debug   SRBD-2 nousheen  R       0:54      1 105<br>               122     debug   SRBD-3 nousheen  R       0:54      1 nousheen<br>               123     debug   SRBD-4 nousheen  R       0:54      2 105,nousheen<br>                          <br>                     <b><br>[root@nousheen ~]# scontrol show nodes</b><br>NodeName=101 Arch=x86_64 CoresPerSocket=6 <br>   CPUAlloc=8 CPUTot=12 CPULoad=0.01<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=(null)<br>   NodeAddr=192.168.60.118 NodeHostName=101 Version=21.08.4<br>   OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022 <br>   RealMemory=31919 AllocMem=0 FreeMem=293 Sockets=1 Boards=1<br>   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   Partitions=debug <br>   BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-02T19:56:01<br>   LastBusyTime=2022-12-02T19:58:14<br>   CfgTRES=cpu=12,mem=31919M,billing=12<br>   AllocTRES=cpu=8<br>   CapWatts=n/a<br>   CurrentWatts=0 AveWatts=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br>NodeName=104 Arch=x86_64 CoresPerSocket=6 <br>   CPUAlloc=0 CPUTot=12 CPULoad=0.01<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=(null)<br>   NodeAddr=192.168.60.104 NodeHostName=104 Version=21.08.4<br>   OS=Linux 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC 2022 <br>   RealMemory=31889 AllocMem=0 FreeMem=30433 Sockets=1 Boards=1<br>   State=IDLE+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   Partitions=debug <br>   BootTime=2022-11-24T11:15:43 SlurmdStartTime=2022-12-02T19:57:29<br>   LastBusyTime=2022-12-02T19:58:14<br>   CfgTRES=cpu=12,mem=31889M,billing=12<br>   AllocTRES=<br>   CapWatts=n/a<br>   CurrentWatts=0 AveWatts=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br>NodeName=105 Arch=x86_64 CoresPerSocket=6 <br>   CPUAlloc=12 CPUTot=12 CPULoad=1.03<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=(null)<br>   NodeAddr=192.168.60.105 NodeHostName=105 Version=21.08.4<br>   OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 <br>   RealMemory=32051 AllocMem=0 FreeMem=14874 Sockets=1 Boards=1<br>   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   Partitions=debug <br>   BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-02T19:56:57<br>   LastBusyTime=2022-12-02T19:58:14<br>   CfgTRES=cpu=12,mem=32051M,billing=12<br>   AllocTRES=cpu=12<br>   CapWatts=n/a<br>   CurrentWatts=0 AveWatts=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br>NodeName=nousheen Arch=x86_64 CoresPerSocket=6 <br>   CPUAlloc=12 CPUTot=12 CPULoad=0.32<br>   AvailableFeatures=(null)<br>   ActiveFeatures=(null)<br>   Gres=(null)<br>   NodeAddr=192.168.60.194 NodeHostName=nousheen Version=21.08.5<br>   OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 <br>   RealMemory=31889 AllocMem=0 FreeMem=16666 Sockets=1 Boards=1<br>   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>   Partitions=debug <br>   BootTime=2022-12-01T12:00:18 SlurmdStartTime=2022-12-02T19:56:36<br>   LastBusyTime=2022-12-02T19:58:15<br>   CfgTRES=cpu=12,mem=31889M,billing=12<br>   AllocTRES=cpu=12<br>   CapWatts=n/a<br>   CurrentWatts=0 AveWatts=0<br>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br><br><b>[root@104 ~]# scontrol show slurmd</b><br>Active Steps             = NONE<br>Actual CPUs              = 12<br>Actual Boards            = 1<br>Actual sockets           = 1<br>Actual cores             = 6<br>Actual threads per core  = 2<br>Actual real memory       = 31889 MB<br>Actual temp disk space   = 106648 MB<br>Boot time                = 2022-12-02T19:57:29<br>Hostname                 = 104<br>Last slurmctld msg time  = NONE<br>Slurmd PID               = 16906<br>Slurmd Debug             = 3<br>Slurmd Logfile           = /var/log/slurmd.log<br>Version                  = 21.08.4<br><br><br>If you can give me a hint to as what can be the reason behind one node nonresponding or what files or problems I should focus on, I would be highly grateful to you. Thank you for your time.<br><br>Best regards,<br><br>Nousheen <br></div></div></div></div></div></div><br></div><div hspace="streak-pt-mark" style="max-height:1px"><img alt="" style="width:0px;max-height:0px;overflow:hidden" src="https://mailfoogae.appspot.com/t?sender=abm91c2hlZW5wYXJ2YWl6QGdtYWlsLmNvbQ%3D%3D&type=zerocontent&guid=62ff6c27-71c9-4b7c-958e-2800dee489f1"><font color="#ffffff" size="1">ᐧ</font></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Dec 2, 2022 at 11:56 AM Ole Holm Nielsen <<a href="mailto:Ole.H.Nielsen@fysik.dtu.dk">Ole.H.Nielsen@fysik.dtu.dk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Nousheen,<br>
<br>
It seems that you have configured incorrectly the nodes in slurm.conf.  I <br>
notice this:<br>
<br>
   RealMemory=1<br>
<br>
This means 1 Megabyte of RAM memory, we only had this with IBM PCs back in <br>
the 1980ies :-)<br>
<br>
See how to configure nodes in <br>
<a href="https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration" rel="noreferrer" target="_blank">https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration</a><br>
<br>
You must run "slurmd -C" on each node to determine its actual hardware.<br>
<br>
I hope this helps.<br>
<br>
/Ole<br>
<br>
On 12/1/22 21:08, Nousheen wrote:<br>
> Dear Robbert,<br>
> <br>
> Thankyou so much for your response. I was so focused on sync of time that <br>
> I missed the date on one of the nodes which was 1 day behind as you said. <br>
> I have corrected it and now i get the following output in status.<br>
> <br>
> *(base) [nousheen@nousheen slurm]$ systemctl status slurmctld.service -l*<br>
> ● slurmctld.service - Slurm controller daemon<br>
>     Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor <br>
> preset: disabled)<br>
>     Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago<br>
>   Main PID: 19475 (slurmctld)<br>
>      Tasks: 10<br>
>     Memory: 4.5M<br>
>     CGroup: /system.slice/slurmctld.service<br>
>             ├─19475 /usr/sbin/slurmctld -D -s<br>
>             └─19538 slurmctld: slurmscriptd<br>
> <br>
> Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill: <br>
> _start_job: Started JobId=106 in debug on 101<br>
> Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: <br>
> JobId=106 WEXITSTATUS 1<br>
> Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: <br>
> JobId=106 done<br>
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate <br>
> JobId=107 NodeList=101 #CPUs=8 Partition=debug<br>
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate <br>
> JobId=108 NodeList=105 #CPUs=8 Partition=debug<br>
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate <br>
> JobId=109 NodeList=nousheen #CPUs=8 Partition=debug<br>
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: <br>
> JobId=107 WEXITSTATUS 1<br>
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: <br>
> JobId=107 done<br>
> Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: <br>
> JobId=108 WEXITSTATUS 1<br>
> Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: <br>
> JobId=108 done<br>
> <br>
> I have total four nodes one of which is the server node. After submitting <br>
> a job, the job only runs at my server compute node while all the other <br>
> nodes are IDLE, DOWN or nonresponding. The details are given below:<br>
> <br>
> *(base) [nousheen@nousheen slurm]$ scontrol show nodes*<br>
> NodeName=101 Arch=x86_64 CoresPerSocket=6<br>
>     CPUAlloc=0 CPUTot=12 CPULoad=0.01<br>
>     AvailableFeatures=(null)<br>
>     ActiveFeatures=(null)<br>
>     Gres=(null)<br>
>     NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4<br>
>     OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022<br>
>     RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1<br>
>     State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>
>     Partitions=debug<br>
>     BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57<br>
>     LastBusyTime=2022-12-02T00:58:31<br>
>     CfgTRES=cpu=12,mem=1M,billing=12<br>
>     AllocTRES=<br>
>     CapWatts=n/a<br>
>     CurrentWatts=0 AveWatts=0<br>
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
> <br>
> NodeName=104 CoresPerSocket=6<br>
>     CPUAlloc=0 CPUTot=12 CPULoad=N/A<br>
>     AvailableFeatures=(null)<br>
>     ActiveFeatures=(null)<br>
>     Gres=(null)<br>
>     NodeAddr=192.168.60.114 NodeHostName=104<br>
>     RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1<br>
>     State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 <br>
> Owner=N/A MCS_label=N/A<br>
>     Partitions=debug<br>
>     BootTime=None SlurmdStartTime=None<br>
>     LastBusyTime=2022-12-01T21:37:35<br>
>     CfgTRES=cpu=12,mem=1M,billing=12<br>
>     AllocTRES=<br>
>     CapWatts=n/a<br>
>     CurrentWatts=0 AveWatts=0<br>
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
>     Reason=Not responding [slurm@2022-12-01T16:22:28]<br>
> <br>
> NodeName=105 Arch=x86_64 CoresPerSocket=6<br>
>     CPUAlloc=0 CPUTot=12 CPULoad=1.08<br>
>     AvailableFeatures=(null)<br>
>     ActiveFeatures=(null)<br>
>     Gres=(null)<br>
>     NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4<br>
>     OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022<br>
>     RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1<br>
>     State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>
>     Partitions=debug<br>
>     BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30<br>
>     LastBusyTime=2022-12-01T21:47:11<br>
>     CfgTRES=cpu=12,mem=1M,billing=12<br>
>     AllocTRES=<br>
>     CapWatts=n/a<br>
>     CurrentWatts=0 AveWatts=0<br>
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
> <br>
> NodeName=nousheen Arch=x86_64 CoresPerSocket=6<br>
>     CPUAlloc=8 CPUTot=12 CPULoad=6.73<br>
>     AvailableFeatures=(null)<br>
>     ActiveFeatures=(null)<br>
>     Gres=(null)<br>
>     NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5<br>
>     OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021<br>
>     RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1<br>
>     State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>
>     Partitions=debug<br>
>     BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42<br>
>     LastBusyTime=2022-12-01T21:37:39<br>
>     CfgTRES=cpu=12,mem=1M,billing=12<br>
>     AllocTRES=cpu=8<br>
>     CapWatts=n/a<br>
>     CurrentWatts=0 AveWatts=0<br>
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
> <br>
> Where as this command shows only one node on which job is running:<br>
> <br>
> *(base) [nousheen@nousheen slurm]$ squeue -j*<br>
>               JOBID PARTITION     NAME     USER ST       TIME  NODES <br>
> NODELIST(REASON)<br>
>                 109     debug   SRBD-4 nousheen  R    3:17:48      1 nousheen<br>
> <br>
> Can you please guide me as to why my compute nodes are down and not working?<br>
> <br>
> Thank you for your time.<br>
> <br>
> <br>
> Best Regards,<br>
> Nousheen Parvaiz<br>
> <br>
> <br>
> ᐧ<br>
> <br>
> On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <<a href="mailto:mrobbert@mines.edu" target="_blank">mrobbert@mines.edu</a> <br>
> <mailto:<a href="mailto:mrobbert@mines.edu" target="_blank">mrobbert@mines.edu</a>>> wrote:<br>
> <br>
>     I believe that the error you need to pay attention to for this issue<br>
>     is this line:____<br>
> <br>
>     __ __<br>
> <br>
>     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for<br>
>     out of sync clocks____<br>
> <br>
>     __ __<br>
> <br>
>     __ __<br>
> <br>
>     It looks like your compute nodes clock is a full day ahead of your<br>
>     controller node. Dec. 2 instead of Dec. 1. The clocks need to be in<br>
>     sync for munge to work.____<br>
> <br>
>     __ __<br>
> <br>
>     *Mike Robbert*____<br>
> <br>
>     *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced<br>
>     Research Computing*____<br>
> <br>
>     Information and Technology Solutions (ITS)____<br>
> <br>
>     303-273-3786 | <a href="mailto:mrobbert@mines.edu" target="_blank">mrobbert@mines.edu</a> <mailto:<a href="mailto:mrobbert@mines.edu" target="_blank">mrobbert@mines.edu</a>>____<br>
> <br>
>     A close up of a sign Description automatically generated____<br>
> <br>
>     *Our values:*Trust | Integrity | Respect | Responsibility____<br>
> <br>
>     __ __<br>
> <br>
>     __ __<br>
> <br>
>     *From: *slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a><br>
>     <mailto:<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>>> on behalf of Nousheen<br>
>     <<a href="mailto:nousheenparvaiz@gmail.com" target="_blank">nousheenparvaiz@gmail.com</a> <mailto:<a href="mailto:nousheenparvaiz@gmail.com" target="_blank">nousheenparvaiz@gmail.com</a>>><br>
>     *Date: *Thursday, December 1, 2022 at 06:19<br>
>     *To: *Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a><br>
>     <mailto:<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>>><br>
>     *Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge:<br>
>     _print_cred: DECODED____<br>
> <br>
>     *CAUTION:*This email originated from outside of the Colorado School of<br>
>     Mines organization. Do not click on links or open attachments unless<br>
>     you recognize the sender and know the content is safe.____<br>
> <br>
>     __ __<br>
> <br>
>     __ __<br>
> <br>
>     __ __<br>
> <br>
>     Hello Everyone,____<br>
> <br>
>     __ __<br>
> <br>
>     I am using slurm version 21.08.5 and Centos 7.____<br>
> <br>
>     __ __<br>
> <br>
>       I successfully start slurmd on all compute nodes but when I start<br>
>     slurmctld on server node it gives the following error:____<br>
> <br>
>     __ __<br>
> <br>
>     *(base) [nousheen@nousheen ~]$ systemctl status slurmctld.service -l*<br>
>     ● slurmctld.service - Slurm controller daemon<br>
>         Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;<br>
>     vendor preset: disabled)<br>
>         Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h<br>
>     16min ago<br>
>       Main PID: 1631 (slurmctld)<br>
>          Tasks: 10<br>
>         Memory: 4.0M<br>
>         CGroup: /system.slice/slurmctld.service<br>
>     ├─1631 /usr/sbin/slurmctld -D -s<br>
>                 └─1818 slurmctld: slurmscriptd<br>
> <br>
>     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge:<br>
>     _print_cred: DECODED: Thu Dec 01 16:17:19 2022<br>
>     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for<br>
>     out of sync clocks<br>
>     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge<br>
>     decode failed: Rewound credential<br>
>     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:<br>
>     _print_cred: ENCODED: Fri Dec 02 16:16:55 2022<br>
>     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:<br>
>     _print_cred: DECODED: Thu Dec 01 16:17:20 2022<br>
>     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for<br>
>     out of sync clocks<br>
>     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge<br>
>     decode failed: Rewound credential<br>
>     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:<br>
>     _print_cred: ENCODED: Fri Dec 02 16:16:56 2022<br>
>     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:<br>
>     _print_cred: DECODED: Thu Dec 01 16:17:21 2022<br>
>     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for<br>
>     out of sync clocks____<br>
> <br>
>     __ __<br>
> <br>
>     When I run the following command on compute nodes I get the following<br>
>     output:____<br>
> <br>
>     __ __<br>
> <br>
>       [gpu101@101 ~]$*munge -n | unmunge*____<br>
> <br>
>     STATUS:           Success (0)<br>
>     ENCODE_HOST:      ??? (0.0.0.101)<br>
>     ENCODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)<br>
>     DECODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)<br>
>     TTL:              300<br>
>     CIPHER:           aes128 (4)<br>
>     MAC:              sha1 (3)<br>
>     ZIP:              none (0)<br>
>     UID:              gpu101 (1000)<br>
>     GID:              gpu101 (1000)<br>
>     LENGTH:           0____<br>
> <br>
>     __ __<br>
> <br>
>     Is this error because the encode_host name has question marks and the<br>
>     IP is also not picked correctly by munge. How can I correct this? All<br>
>     the nodes keep non-responding when I run a job. However, I have all<br>
>     the clocks synced across the cluster. ____<br>
> <br>
>     __ __<br>
> <br>
>     I am new to slurm. Kindly guide me in this matter.____<br>
<br>
</blockquote></div>