[slurm-users] [External] ERROR: slurmctld: auth/munge: _print_cred: DECODED

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Fri Dec 2 06:52:44 UTC 2022


Hi Nousheen,

It seems that you have configured incorrectly the nodes in slurm.conf.  I 
notice this:

   RealMemory=1

This means 1 Megabyte of RAM memory, we only had this with IBM PCs back in 
the 1980ies :-)

See how to configure nodes in 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration

You must run "slurmd -C" on each node to determine its actual hardware.

I hope this helps.

/Ole

On 12/1/22 21:08, Nousheen wrote:
> Dear Robbert,
> 
> Thankyou so much for your response. I was so focused on sync of time that 
> I missed the date on one of the nodes which was 1 day behind as you said. 
> I have corrected it and now i get the following output in status.
> 
> *(base) [nousheen at nousheen slurm]$ systemctl status slurmctld.service -l*
> ● slurmctld.service - Slurm controller daemon
>     Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor 
> preset: disabled)
>     Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago
>   Main PID: 19475 (slurmctld)
>      Tasks: 10
>     Memory: 4.5M
>     CGroup: /system.slice/slurmctld.service
>             ├─19475 /usr/sbin/slurmctld -D -s
>             └─19538 slurmctld: slurmscriptd
> 
> Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill: 
> _start_job: Started JobId=106 in debug on 101
> Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: 
> JobId=106 WEXITSTATUS 1
> Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: 
> JobId=106 done
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate 
> JobId=107 NodeList=101 #CPUs=8 Partition=debug
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate 
> JobId=108 NodeList=105 #CPUs=8 Partition=debug
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate 
> JobId=109 NodeList=nousheen #CPUs=8 Partition=debug
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: 
> JobId=107 WEXITSTATUS 1
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: 
> JobId=107 done
> Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: 
> JobId=108 WEXITSTATUS 1
> Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: 
> JobId=108 done
> 
> I have total four nodes one of which is the server node. After submitting 
> a job, the job only runs at my server compute node while all the other 
> nodes are IDLE, DOWN or nonresponding. The details are given below:
> 
> *(base) [nousheen at nousheen slurm]$ scontrol show nodes*
> NodeName=101 Arch=x86_64 CoresPerSocket=6
>     CPUAlloc=0 CPUTot=12 CPULoad=0.01
>     AvailableFeatures=(null)
>     ActiveFeatures=(null)
>     Gres=(null)
>     NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4
>     OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022
>     RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1
>     State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>     Partitions=debug
>     BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57
>     LastBusyTime=2022-12-02T00:58:31
>     CfgTRES=cpu=12,mem=1M,billing=12
>     AllocTRES=
>     CapWatts=n/a
>     CurrentWatts=0 AveWatts=0
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> NodeName=104 CoresPerSocket=6
>     CPUAlloc=0 CPUTot=12 CPULoad=N/A
>     AvailableFeatures=(null)
>     ActiveFeatures=(null)
>     Gres=(null)
>     NodeAddr=192.168.60.114 NodeHostName=104
>     RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1
>     State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 
> Owner=N/A MCS_label=N/A
>     Partitions=debug
>     BootTime=None SlurmdStartTime=None
>     LastBusyTime=2022-12-01T21:37:35
>     CfgTRES=cpu=12,mem=1M,billing=12
>     AllocTRES=
>     CapWatts=n/a
>     CurrentWatts=0 AveWatts=0
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>     Reason=Not responding [slurm at 2022-12-01T16:22:28]
> 
> NodeName=105 Arch=x86_64 CoresPerSocket=6
>     CPUAlloc=0 CPUTot=12 CPULoad=1.08
>     AvailableFeatures=(null)
>     ActiveFeatures=(null)
>     Gres=(null)
>     NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4
>     OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022
>     RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1
>     State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>     Partitions=debug
>     BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30
>     LastBusyTime=2022-12-01T21:47:11
>     CfgTRES=cpu=12,mem=1M,billing=12
>     AllocTRES=
>     CapWatts=n/a
>     CurrentWatts=0 AveWatts=0
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> NodeName=nousheen Arch=x86_64 CoresPerSocket=6
>     CPUAlloc=8 CPUTot=12 CPULoad=6.73
>     AvailableFeatures=(null)
>     ActiveFeatures=(null)
>     Gres=(null)
>     NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5
>     OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
>     RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1
>     State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>     Partitions=debug
>     BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42
>     LastBusyTime=2022-12-01T21:37:39
>     CfgTRES=cpu=12,mem=1M,billing=12
>     AllocTRES=cpu=8
>     CapWatts=n/a
>     CurrentWatts=0 AveWatts=0
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> Where as this command shows only one node on which job is running:
> 
> *(base) [nousheen at nousheen slurm]$ squeue -j*
>               JOBID PARTITION     NAME     USER ST       TIME  NODES 
> NODELIST(REASON)
>                 109     debug   SRBD-4 nousheen  R    3:17:48      1 nousheen
> 
> Can you please guide me as to why my compute nodes are down and not working?
> 
> Thank you for your time.
> 
> 
> Best Regards,
> Nousheen Parvaiz
> 
> 
>> 
> On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <mrobbert at mines.edu 
> <mailto:mrobbert at mines.edu>> wrote:
> 
>     I believe that the error you need to pay attention to for this issue
>     is this line:____
> 
>     __ __
> 
>     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
>     out of sync clocks____
> 
>     __ __
> 
>     __ __
> 
>     It looks like your compute nodes clock is a full day ahead of your
>     controller node. Dec. 2 instead of Dec. 1. The clocks need to be in
>     sync for munge to work.____
> 
>     __ __
> 
>     *Mike Robbert*____
> 
>     *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced
>     Research Computing*____
> 
>     Information and Technology Solutions (ITS)____
> 
>     303-273-3786 | mrobbert at mines.edu <mailto:mrobbert at mines.edu>____
> 
>     A close up of a sign Description automatically generated____
> 
>     *Our values:*Trust | Integrity | Respect | Responsibility____
> 
>     __ __
> 
>     __ __
> 
>     *From: *slurm-users <slurm-users-bounces at lists.schedmd.com
>     <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of Nousheen
>     <nousheenparvaiz at gmail.com <mailto:nousheenparvaiz at gmail.com>>
>     *Date: *Thursday, December 1, 2022 at 06:19
>     *To: *Slurm User Community List <slurm-users at lists.schedmd.com
>     <mailto:slurm-users at lists.schedmd.com>>
>     *Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge:
>     _print_cred: DECODED____
> 
>     *CAUTION:*This email originated from outside of the Colorado School of
>     Mines organization. Do not click on links or open attachments unless
>     you recognize the sender and know the content is safe.____
> 
>     __ __
> 
>     __ __
> 
>     __ __
> 
>     Hello Everyone,____
> 
>     __ __
> 
>     I am using slurm version 21.08.5 and Centos 7.____
> 
>     __ __
> 
>       I successfully start slurmd on all compute nodes but when I start
>     slurmctld on server node it gives the following error:____
> 
>     __ __
> 
>     *(base) [nousheen at nousheen ~]$ systemctl status slurmctld.service -l*
>     ● slurmctld.service - Slurm controller daemon
>         Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
>     vendor preset: disabled)
>         Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h
>     16min ago
>       Main PID: 1631 (slurmctld)
>          Tasks: 10
>         Memory: 4.0M
>         CGroup: /system.slice/slurmctld.service
>     ├─1631 /usr/sbin/slurmctld -D -s
>                 └─1818 slurmctld: slurmscriptd
> 
>     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge:
>     _print_cred: DECODED: Thu Dec 01 16:17:19 2022
>     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
>     out of sync clocks
>     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge
>     decode failed: Rewound credential
>     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
>     _print_cred: ENCODED: Fri Dec 02 16:16:55 2022
>     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
>     _print_cred: DECODED: Thu Dec 01 16:17:20 2022
>     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for
>     out of sync clocks
>     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge
>     decode failed: Rewound credential
>     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
>     _print_cred: ENCODED: Fri Dec 02 16:16:56 2022
>     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
>     _print_cred: DECODED: Thu Dec 01 16:17:21 2022
>     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for
>     out of sync clocks____
> 
>     __ __
> 
>     When I run the following command on compute nodes I get the following
>     output:____
> 
>     __ __
> 
>       [gpu101 at 101 ~]$*munge -n | unmunge*____
> 
>     STATUS:           Success (0)
>     ENCODE_HOST:      ??? (0.0.0.101)
>     ENCODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)
>     DECODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)
>     TTL:              300
>     CIPHER:           aes128 (4)
>     MAC:              sha1 (3)
>     ZIP:              none (0)
>     UID:              gpu101 (1000)
>     GID:              gpu101 (1000)
>     LENGTH:           0____
> 
>     __ __
> 
>     Is this error because the encode_host name has question marks and the
>     IP is also not picked correctly by munge. How can I correct this? All
>     the nodes keep non-responding when I run a job. However, I have all
>     the clocks synced across the cluster. ____
> 
>     __ __
> 
>     I am new to slurm. Kindly guide me in this matter.____



More information about the slurm-users mailing list