[slurm-users] [External] ERROR: slurmctld: auth/munge: _print_cred: DECODED

Thu Dec 1 20:08:54 UTC 2022

Dear Robbert,

Thankyou so much for your response. I was so focused on sync of time that I
missed the date on one of the nodes which was 1 day behind as you said. I
have corrected it and now i get the following output in status.

*(base) [nousheen at nousheen slurm]$ systemctl status slurmctld.service -l*
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor
preset: disabled)
   Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago
 Main PID: 19475 (slurmctld)
    Tasks: 10
   Memory: 4.5M
   CGroup: /system.slice/slurmctld.service
           ├─19475 /usr/sbin/slurmctld -D -s
           └─19538 slurmctld: slurmscriptd

Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill:
_start_job: Started JobId=106 in debug on 101
Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=106 WEXITSTATUS 1
Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=106 done
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
JobId=107 NodeList=101 #CPUs=8 Partition=debug
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
JobId=108 NodeList=105 #CPUs=8 Partition=debug
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
JobId=109 NodeList=nousheen #CPUs=8 Partition=debug
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=107 WEXITSTATUS 1
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=107 done
Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=108 WEXITSTATUS 1
Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=108 done

I have total four nodes one of which is the server node. After submitting a
job, the job only runs at my server compute node while all the other nodes
are IDLE, DOWN or nonresponding. The details are given below:

*(base) [nousheen at nousheen slurm]$ scontrol show nodes*
NodeName=101 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUTot=12 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4
   OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022
   RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57
   LastBusyTime=2022-12-02T00:58:31
   CfgTRES=cpu=12,mem=1M,billing=12
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=104 CoresPerSocket=6
   CPUAlloc=0 CPUTot=12 CPULoad=N/A
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.60.114 NodeHostName=104
   RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1
   State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=debug
   BootTime=None SlurmdStartTime=None
   LastBusyTime=2022-12-01T21:37:35
   CfgTRES=cpu=12,mem=1M,billing=12
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Not responding [slurm at 2022-12-01T16:22:28]

NodeName=105 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUTot=12 CPULoad=1.08
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4
   OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022
   RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30
   LastBusyTime=2022-12-01T21:47:11
   CfgTRES=cpu=12,mem=1M,billing=12
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=nousheen Arch=x86_64 CoresPerSocket=6
   CPUAlloc=8 CPUTot=12 CPULoad=6.73
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5
   OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
   RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42
   LastBusyTime=2022-12-01T21:37:39
   CfgTRES=cpu=12,mem=1M,billing=12
   AllocTRES=cpu=8
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Where as this command shows only one node on which job is running:

*(base) [nousheen at nousheen slurm]$ squeue -j*
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
               109     debug   SRBD-4 nousheen  R    3:17:48      1 nousheen

Can you please guide me as to why my compute nodes are down and not working?

Thank you for your time.

Best Regards,
Nousheen Parvaiz

ᐧ

On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <mrobbert at mines.edu> wrote:

> I believe that the error you need to pay attention to for this issue is
> this line:
>
>
>
> Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for out
> of sync clocks
>
>
>
>
>
> It looks like your compute nodes clock is a full day ahead of your
> controller node. Dec. 2 instead of Dec. 1. The clocks need to be in sync
> for munge to work.
>
>
>
> *Mike Robbert*
>
> *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research
> Computing*
>
> Information and Technology Solutions (ITS)
>
> 303-273-3786 | mrobbert at mines.edu
>
> [image: A close up of a sign Description automatically generated]
>
> *Our values:* Trust | Integrity | Respect | Responsibility
>
>
>
>
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Nousheen <nousheenparvaiz at gmail.com>
> *Date: *Thursday, December 1, 2022 at 06:19
> *To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge:
> _print_cred: DECODED
>
> *CAUTION:* This email originated from outside of the Colorado School of
> Mines organization. Do not click on links or open attachments unless you
> recognize the sender and know the content is safe.
>
>
>
>
>
>
>
> Hello Everyone,
>
>
>
> I am using slurm version 21.08.5 and Centos 7.
>
>
>
>  I successfully start slurmd on all compute nodes but when I start
> slurmctld on server node it gives the following error:
>
>
>
> *(base) [nousheen at nousheen ~]$ systemctl status slurmctld.service -l*
> ● slurmctld.service - Slurm controller daemon
>    Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor
> preset: disabled)
>    Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h 16min ago
>  Main PID: 1631 (slurmctld)
>     Tasks: 10
>    Memory: 4.0M
>    CGroup: /system.slice/slurmctld.service
>            ├─1631 /usr/sbin/slurmctld -D -s
>            └─1818 slurmctld: slurmscriptd
>
> Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: DECODED: Thu Dec 01 16:17:19 2022
> Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for out
> of sync clocks
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge decode
> failed: Rewound credential
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: ENCODED: Fri Dec 02 16:16:55 2022
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: DECODED: Thu Dec 01 16:17:20 2022
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for out
> of sync clocks
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge decode
> failed: Rewound credential
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: ENCODED: Fri Dec 02 16:16:56 2022
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: DECODED: Thu Dec 01 16:17:21 2022
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for out
> of sync clocks
>
>
>
> When I run the following command on compute nodes I get the following
> output:
>
>
>
>  [gpu101 at 101 ~]$* munge -n | unmunge*
>
> STATUS:           Success (0)
> ENCODE_HOST:      ??? (0.0.0.101)
> ENCODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)
> DECODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)
> TTL:              300
> CIPHER:           aes128 (4)
> MAC:              sha1 (3)
> ZIP:              none (0)
> UID:              gpu101 (1000)
> GID:              gpu101 (1000)
> LENGTH:           0
>
>
>
> Is this error because the encode_host name has question marks and the IP
> is also not picked correctly by munge. How can I correct this? All the
> nodes keep non-responding when I run a job. However, I have all the clocks
> synced across the cluster.
>
>
>
> I am new to slurm. Kindly guide me in this matter.
>
>
>
>
>
>
> Best Regards,
>
> Nousheen Parvaiz
> Ph.D. Scholar
>
>
>
> ᐧ
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221202/e321f5e2/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 8292 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221202/e321f5e2/attachment-0001.png>