<div dir="ltr"><br clear="all"><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr">Dear Ole,<br><br>Thank you so much for your response. I have now adjusted the RealMemory in the slurm.conf which was set by default previously. Your insight was really helpful. Now, when I submit the job, it is running on three nodes but one node (104) is not responding. The details of some commands are given below.<br><br><br><b>[root@nousheen ~]# squeue -j</b><br> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<br> 120 debug SRBD-1 nousheen R 0:54 1 101<br> 121 debug SRBD-2 nousheen R 0:54 1 105<br> 122 debug SRBD-3 nousheen R 0:54 1 nousheen<br> 123 debug SRBD-4 nousheen R 0:54 2 105,nousheen<br> <br> <b><br>[root@nousheen ~]# scontrol show nodes</b><br>NodeName=101 Arch=x86_64 CoresPerSocket=6 <br> CPUAlloc=8 CPUTot=12 CPULoad=0.01<br> AvailableFeatures=(null)<br> ActiveFeatures=(null)<br> Gres=(null)<br> NodeAddr=192.168.60.118 NodeHostName=101 Version=21.08.4<br> OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022 <br> RealMemory=31919 AllocMem=0 FreeMem=293 Sockets=1 Boards=1<br> State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br> Partitions=debug <br> BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-02T19:56:01<br> LastBusyTime=2022-12-02T19:58:14<br> CfgTRES=cpu=12,mem=31919M,billing=12<br> AllocTRES=cpu=8<br> CapWatts=n/a<br> CurrentWatts=0 AveWatts=0<br> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br>NodeName=104 Arch=x86_64 CoresPerSocket=6 <br> CPUAlloc=0 CPUTot=12 CPULoad=0.01<br> AvailableFeatures=(null)<br> ActiveFeatures=(null)<br> Gres=(null)<br> NodeAddr=192.168.60.104 NodeHostName=104 Version=21.08.4<br> OS=Linux 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC 2022 <br> RealMemory=31889 AllocMem=0 FreeMem=30433 Sockets=1 Boards=1<br> State=IDLE+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br> Partitions=debug <br> BootTime=2022-11-24T11:15:43 SlurmdStartTime=2022-12-02T19:57:29<br> LastBusyTime=2022-12-02T19:58:14<br> CfgTRES=cpu=12,mem=31889M,billing=12<br> AllocTRES=<br> CapWatts=n/a<br> CurrentWatts=0 AveWatts=0<br> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br>NodeName=105 Arch=x86_64 CoresPerSocket=6 <br> CPUAlloc=12 CPUTot=12 CPULoad=1.03<br> AvailableFeatures=(null)<br> ActiveFeatures=(null)<br> Gres=(null)<br> NodeAddr=192.168.60.105 NodeHostName=105 Version=21.08.4<br> OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 <br> RealMemory=32051 AllocMem=0 FreeMem=14874 Sockets=1 Boards=1<br> State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br> Partitions=debug <br> BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-02T19:56:57<br> LastBusyTime=2022-12-02T19:58:14<br> CfgTRES=cpu=12,mem=32051M,billing=12<br> AllocTRES=cpu=12<br> CapWatts=n/a<br> CurrentWatts=0 AveWatts=0<br> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br>NodeName=nousheen Arch=x86_64 CoresPerSocket=6 <br> CPUAlloc=12 CPUTot=12 CPULoad=0.32<br> AvailableFeatures=(null)<br> ActiveFeatures=(null)<br> Gres=(null)<br> NodeAddr=192.168.60.194 NodeHostName=nousheen Version=21.08.5<br> OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 <br> RealMemory=31889 AllocMem=0 FreeMem=16666 Sockets=1 Boards=1<br> State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br> Partitions=debug <br> BootTime=2022-12-01T12:00:18 SlurmdStartTime=2022-12-02T19:56:36<br> LastBusyTime=2022-12-02T19:58:15<br> CfgTRES=cpu=12,mem=31889M,billing=12<br> AllocTRES=cpu=12<br> CapWatts=n/a<br> CurrentWatts=0 AveWatts=0<br> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br><br><br><b>[root@104 ~]# scontrol show slurmd</b><br>Active Steps = NONE<br>Actual CPUs = 12<br>Actual Boards = 1<br>Actual sockets = 1<br>Actual cores = 6<br>Actual threads per core = 2<br>Actual real memory = 31889 MB<br>Actual temp disk space = 106648 MB<br>Boot time = 2022-12-02T19:57:29<br>Hostname = 104<br>Last slurmctld msg time = NONE<br>Slurmd PID = 16906<br>Slurmd Debug = 3<br>Slurmd Logfile = /var/log/slurmd.log<br>Version = 21.08.4<br><br><br>If you can give me a hint to as what can be the reason behind one node nonresponding or what files or problems I should focus on, I would be highly grateful to you. Thank you for your time.<br><br>Best regards,<br><br>Nousheen <br></div></div></div></div></div></div><br></div><div hspace="streak-pt-mark" style="max-height:1px"><img alt="" style="width:0px;max-height:0px;overflow:hidden" src="https://mailfoogae.appspot.com/t?sender=abm91c2hlZW5wYXJ2YWl6QGdtYWlsLmNvbQ%3D%3D&type=zerocontent&guid=62ff6c27-71c9-4b7c-958e-2800dee489f1"><font color="#ffffff" size="1">ᐧ</font></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Dec 2, 2022 at 11:56 AM Ole Holm Nielsen <<a href="mailto:Ole.H.Nielsen@fysik.dtu.dk">Ole.H.Nielsen@fysik.dtu.dk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Nousheen,<br>
<br>
It seems that you have configured incorrectly the nodes in slurm.conf. I <br>
notice this:<br>
<br>
RealMemory=1<br>
<br>
This means 1 Megabyte of RAM memory, we only had this with IBM PCs back in <br>
the 1980ies :-)<br>
<br>
See how to configure nodes in <br>
<a href="https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration" rel="noreferrer" target="_blank">https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration</a><br>
<br>
You must run "slurmd -C" on each node to determine its actual hardware.<br>
<br>
I hope this helps.<br>
<br>
/Ole<br>
<br>
On 12/1/22 21:08, Nousheen wrote:<br>
> Dear Robbert,<br>
> <br>
> Thankyou so much for your response. I was so focused on sync of time that <br>
> I missed the date on one of the nodes which was 1 day behind as you said. <br>
> I have corrected it and now i get the following output in status.<br>
> <br>
> *(base) [nousheen@nousheen slurm]$ systemctl status slurmctld.service -l*<br>
> ● slurmctld.service - Slurm controller daemon<br>
> Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor <br>
> preset: disabled)<br>
> Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago<br>
> Main PID: 19475 (slurmctld)<br>
> Tasks: 10<br>
> Memory: 4.5M<br>
> CGroup: /system.slice/slurmctld.service<br>
> ├─19475 /usr/sbin/slurmctld -D -s<br>
> └─19538 slurmctld: slurmscriptd<br>
> <br>
> Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill: <br>
> _start_job: Started JobId=106 in debug on 101<br>
> Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: <br>
> JobId=106 WEXITSTATUS 1<br>
> Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: <br>
> JobId=106 done<br>
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate <br>
> JobId=107 NodeList=101 #CPUs=8 Partition=debug<br>
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate <br>
> JobId=108 NodeList=105 #CPUs=8 Partition=debug<br>
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate <br>
> JobId=109 NodeList=nousheen #CPUs=8 Partition=debug<br>
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: <br>
> JobId=107 WEXITSTATUS 1<br>
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: <br>
> JobId=107 done<br>
> Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: <br>
> JobId=108 WEXITSTATUS 1<br>
> Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: <br>
> JobId=108 done<br>
> <br>
> I have total four nodes one of which is the server node. After submitting <br>
> a job, the job only runs at my server compute node while all the other <br>
> nodes are IDLE, DOWN or nonresponding. The details are given below:<br>
> <br>
> *(base) [nousheen@nousheen slurm]$ scontrol show nodes*<br>
> NodeName=101 Arch=x86_64 CoresPerSocket=6<br>
> CPUAlloc=0 CPUTot=12 CPULoad=0.01<br>
> AvailableFeatures=(null)<br>
> ActiveFeatures=(null)<br>
> Gres=(null)<br>
> NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4<br>
> OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022<br>
> RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1<br>
> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>
> Partitions=debug<br>
> BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57<br>
> LastBusyTime=2022-12-02T00:58:31<br>
> CfgTRES=cpu=12,mem=1M,billing=12<br>
> AllocTRES=<br>
> CapWatts=n/a<br>
> CurrentWatts=0 AveWatts=0<br>
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
> <br>
> NodeName=104 CoresPerSocket=6<br>
> CPUAlloc=0 CPUTot=12 CPULoad=N/A<br>
> AvailableFeatures=(null)<br>
> ActiveFeatures=(null)<br>
> Gres=(null)<br>
> NodeAddr=192.168.60.114 NodeHostName=104<br>
> RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1<br>
> State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 <br>
> Owner=N/A MCS_label=N/A<br>
> Partitions=debug<br>
> BootTime=None SlurmdStartTime=None<br>
> LastBusyTime=2022-12-01T21:37:35<br>
> CfgTRES=cpu=12,mem=1M,billing=12<br>
> AllocTRES=<br>
> CapWatts=n/a<br>
> CurrentWatts=0 AveWatts=0<br>
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
> Reason=Not responding [slurm@2022-12-01T16:22:28]<br>
> <br>
> NodeName=105 Arch=x86_64 CoresPerSocket=6<br>
> CPUAlloc=0 CPUTot=12 CPULoad=1.08<br>
> AvailableFeatures=(null)<br>
> ActiveFeatures=(null)<br>
> Gres=(null)<br>
> NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4<br>
> OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022<br>
> RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1<br>
> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>
> Partitions=debug<br>
> BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30<br>
> LastBusyTime=2022-12-01T21:47:11<br>
> CfgTRES=cpu=12,mem=1M,billing=12<br>
> AllocTRES=<br>
> CapWatts=n/a<br>
> CurrentWatts=0 AveWatts=0<br>
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
> <br>
> NodeName=nousheen Arch=x86_64 CoresPerSocket=6<br>
> CPUAlloc=8 CPUTot=12 CPULoad=6.73<br>
> AvailableFeatures=(null)<br>
> ActiveFeatures=(null)<br>
> Gres=(null)<br>
> NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5<br>
> OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021<br>
> RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1<br>
> State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>
> Partitions=debug<br>
> BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42<br>
> LastBusyTime=2022-12-01T21:37:39<br>
> CfgTRES=cpu=12,mem=1M,billing=12<br>
> AllocTRES=cpu=8<br>
> CapWatts=n/a<br>
> CurrentWatts=0 AveWatts=0<br>
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
> <br>
> Where as this command shows only one node on which job is running:<br>
> <br>
> *(base) [nousheen@nousheen slurm]$ squeue -j*<br>
> JOBID PARTITION NAME USER ST TIME NODES <br>
> NODELIST(REASON)<br>
> 109 debug SRBD-4 nousheen R 3:17:48 1 nousheen<br>
> <br>
> Can you please guide me as to why my compute nodes are down and not working?<br>
> <br>
> Thank you for your time.<br>
> <br>
> <br>
> Best Regards,<br>
> Nousheen Parvaiz<br>
> <br>
> <br>
> ᐧ<br>
> <br>
> On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <<a href="mailto:mrobbert@mines.edu" target="_blank">mrobbert@mines.edu</a> <br>
> <mailto:<a href="mailto:mrobbert@mines.edu" target="_blank">mrobbert@mines.edu</a>>> wrote:<br>
> <br>
> I believe that the error you need to pay attention to for this issue<br>
> is this line:____<br>
> <br>
> __ __<br>
> <br>
> Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for<br>
> out of sync clocks____<br>
> <br>
> __ __<br>
> <br>
> __ __<br>
> <br>
> It looks like your compute nodes clock is a full day ahead of your<br>
> controller node. Dec. 2 instead of Dec. 1. The clocks need to be in<br>
> sync for munge to work.____<br>
> <br>
> __ __<br>
> <br>
> *Mike Robbert*____<br>
> <br>
> *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced<br>
> Research Computing*____<br>
> <br>
> Information and Technology Solutions (ITS)____<br>
> <br>
> 303-273-3786 | <a href="mailto:mrobbert@mines.edu" target="_blank">mrobbert@mines.edu</a> <mailto:<a href="mailto:mrobbert@mines.edu" target="_blank">mrobbert@mines.edu</a>>____<br>
> <br>
> A close up of a sign Description automatically generated____<br>
> <br>
> *Our values:*Trust | Integrity | Respect | Responsibility____<br>
> <br>
> __ __<br>
> <br>
> __ __<br>
> <br>
> *From: *slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a><br>
> <mailto:<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>>> on behalf of Nousheen<br>
> <<a href="mailto:nousheenparvaiz@gmail.com" target="_blank">nousheenparvaiz@gmail.com</a> <mailto:<a href="mailto:nousheenparvaiz@gmail.com" target="_blank">nousheenparvaiz@gmail.com</a>>><br>
> *Date: *Thursday, December 1, 2022 at 06:19<br>
> *To: *Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a><br>
> <mailto:<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>>><br>
> *Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge:<br>
> _print_cred: DECODED____<br>
> <br>
> *CAUTION:*This email originated from outside of the Colorado School of<br>
> Mines organization. Do not click on links or open attachments unless<br>
> you recognize the sender and know the content is safe.____<br>
> <br>
> __ __<br>
> <br>
> __ __<br>
> <br>
> __ __<br>
> <br>
> Hello Everyone,____<br>
> <br>
> __ __<br>
> <br>
> I am using slurm version 21.08.5 and Centos 7.____<br>
> <br>
> __ __<br>
> <br>
> I successfully start slurmd on all compute nodes but when I start<br>
> slurmctld on server node it gives the following error:____<br>
> <br>
> __ __<br>
> <br>
> *(base) [nousheen@nousheen ~]$ systemctl status slurmctld.service -l*<br>
> ● slurmctld.service - Slurm controller daemon<br>
> Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;<br>
> vendor preset: disabled)<br>
> Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h<br>
> 16min ago<br>
> Main PID: 1631 (slurmctld)<br>
> Tasks: 10<br>
> Memory: 4.0M<br>
> CGroup: /system.slice/slurmctld.service<br>
> ├─1631 /usr/sbin/slurmctld -D -s<br>
> └─1818 slurmctld: slurmscriptd<br>
> <br>
> Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge:<br>
> _print_cred: DECODED: Thu Dec 01 16:17:19 2022<br>
> Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for<br>
> out of sync clocks<br>
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge<br>
> decode failed: Rewound credential<br>
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:<br>
> _print_cred: ENCODED: Fri Dec 02 16:16:55 2022<br>
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:<br>
> _print_cred: DECODED: Thu Dec 01 16:17:20 2022<br>
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for<br>
> out of sync clocks<br>
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge<br>
> decode failed: Rewound credential<br>
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:<br>
> _print_cred: ENCODED: Fri Dec 02 16:16:56 2022<br>
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:<br>
> _print_cred: DECODED: Thu Dec 01 16:17:21 2022<br>
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for<br>
> out of sync clocks____<br>
> <br>
> __ __<br>
> <br>
> When I run the following command on compute nodes I get the following<br>
> output:____<br>
> <br>
> __ __<br>
> <br>
> [gpu101@101 ~]$*munge -n | unmunge*____<br>
> <br>
> STATUS: Success (0)<br>
> ENCODE_HOST: ??? (0.0.0.101)<br>
> ENCODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818)<br>
> DECODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818)<br>
> TTL: 300<br>
> CIPHER: aes128 (4)<br>
> MAC: sha1 (3)<br>
> ZIP: none (0)<br>
> UID: gpu101 (1000)<br>
> GID: gpu101 (1000)<br>
> LENGTH: 0____<br>
> <br>
> __ __<br>
> <br>
> Is this error because the encode_host name has question marks and the<br>
> IP is also not picked correctly by munge. How can I correct this? All<br>
> the nodes keep non-responding when I run a job. However, I have all<br>
> the clocks synced across the cluster. ____<br>
> <br>
> __ __<br>
> <br>
> I am new to slurm. Kindly guide me in this matter.____<br>
<br>
</blockquote></div>