[slurm-users] [External] ERROR: slurmctld: auth/munge: _print_cred: DECODED
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Fri Dec 2 06:52:44 UTC 2022
Hi Nousheen,
It seems that you have configured incorrectly the nodes in slurm.conf. I
notice this:
RealMemory=1
This means 1 Megabyte of RAM memory, we only had this with IBM PCs back in
the 1980ies :-)
See how to configure nodes in
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration
You must run "slurmd -C" on each node to determine its actual hardware.
I hope this helps.
/Ole
On 12/1/22 21:08, Nousheen wrote:
> Dear Robbert,
>
> Thankyou so much for your response. I was so focused on sync of time that
> I missed the date on one of the nodes which was 1 day behind as you said.
> I have corrected it and now i get the following output in status.
>
> *(base) [nousheen at nousheen slurm]$ systemctl status slurmctld.service -l*
> ● slurmctld.service - Slurm controller daemon
> Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor
> preset: disabled)
> Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago
> Main PID: 19475 (slurmctld)
> Tasks: 10
> Memory: 4.5M
> CGroup: /system.slice/slurmctld.service
> ├─19475 /usr/sbin/slurmctld -D -s
> └─19538 slurmctld: slurmscriptd
>
> Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill:
> _start_job: Started JobId=106 in debug on 101
> Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
> JobId=106 WEXITSTATUS 1
> Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
> JobId=106 done
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
> JobId=107 NodeList=101 #CPUs=8 Partition=debug
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
> JobId=108 NodeList=105 #CPUs=8 Partition=debug
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
> JobId=109 NodeList=nousheen #CPUs=8 Partition=debug
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
> JobId=107 WEXITSTATUS 1
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
> JobId=107 done
> Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
> JobId=108 WEXITSTATUS 1
> Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
> JobId=108 done
>
> I have total four nodes one of which is the server node. After submitting
> a job, the job only runs at my server compute node while all the other
> nodes are IDLE, DOWN or nonresponding. The details are given below:
>
> *(base) [nousheen at nousheen slurm]$ scontrol show nodes*
> NodeName=101 Arch=x86_64 CoresPerSocket=6
> CPUAlloc=0 CPUTot=12 CPULoad=0.01
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4
> OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022
> RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1
> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=debug
> BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57
> LastBusyTime=2022-12-02T00:58:31
> CfgTRES=cpu=12,mem=1M,billing=12
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 AveWatts=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> NodeName=104 CoresPerSocket=6
> CPUAlloc=0 CPUTot=12 CPULoad=N/A
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=192.168.60.114 NodeHostName=104
> RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1
> State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1
> Owner=N/A MCS_label=N/A
> Partitions=debug
> BootTime=None SlurmdStartTime=None
> LastBusyTime=2022-12-01T21:37:35
> CfgTRES=cpu=12,mem=1M,billing=12
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 AveWatts=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Not responding [slurm at 2022-12-01T16:22:28]
>
> NodeName=105 Arch=x86_64 CoresPerSocket=6
> CPUAlloc=0 CPUTot=12 CPULoad=1.08
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4
> OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022
> RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1
> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=debug
> BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30
> LastBusyTime=2022-12-01T21:47:11
> CfgTRES=cpu=12,mem=1M,billing=12
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 AveWatts=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> NodeName=nousheen Arch=x86_64 CoresPerSocket=6
> CPUAlloc=8 CPUTot=12 CPULoad=6.73
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5
> OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
> RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1
> State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=debug
> BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42
> LastBusyTime=2022-12-01T21:37:39
> CfgTRES=cpu=12,mem=1M,billing=12
> AllocTRES=cpu=8
> CapWatts=n/a
> CurrentWatts=0 AveWatts=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> Where as this command shows only one node on which job is running:
>
> *(base) [nousheen at nousheen slurm]$ squeue -j*
> JOBID PARTITION NAME USER ST TIME NODES
> NODELIST(REASON)
> 109 debug SRBD-4 nousheen R 3:17:48 1 nousheen
>
> Can you please guide me as to why my compute nodes are down and not working?
>
> Thank you for your time.
>
>
> Best Regards,
> Nousheen Parvaiz
>
>
> ᐧ
>
> On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <mrobbert at mines.edu
> <mailto:mrobbert at mines.edu>> wrote:
>
> I believe that the error you need to pay attention to for this issue
> is this line:____
>
> __ __
>
> Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
> out of sync clocks____
>
> __ __
>
> __ __
>
> It looks like your compute nodes clock is a full day ahead of your
> controller node. Dec. 2 instead of Dec. 1. The clocks need to be in
> sync for munge to work.____
>
> __ __
>
> *Mike Robbert*____
>
> *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced
> Research Computing*____
>
> Information and Technology Solutions (ITS)____
>
> 303-273-3786 | mrobbert at mines.edu <mailto:mrobbert at mines.edu>____
>
> A close up of a sign Description automatically generated____
>
> *Our values:*Trust | Integrity | Respect | Responsibility____
>
> __ __
>
> __ __
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com
> <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of Nousheen
> <nousheenparvaiz at gmail.com <mailto:nousheenparvaiz at gmail.com>>
> *Date: *Thursday, December 1, 2022 at 06:19
> *To: *Slurm User Community List <slurm-users at lists.schedmd.com
> <mailto:slurm-users at lists.schedmd.com>>
> *Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge:
> _print_cred: DECODED____
>
> *CAUTION:*This email originated from outside of the Colorado School of
> Mines organization. Do not click on links or open attachments unless
> you recognize the sender and know the content is safe.____
>
> __ __
>
> __ __
>
> __ __
>
> Hello Everyone,____
>
> __ __
>
> I am using slurm version 21.08.5 and Centos 7.____
>
> __ __
>
> I successfully start slurmd on all compute nodes but when I start
> slurmctld on server node it gives the following error:____
>
> __ __
>
> *(base) [nousheen at nousheen ~]$ systemctl status slurmctld.service -l*
> ● slurmctld.service - Slurm controller daemon
> Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> vendor preset: disabled)
> Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h
> 16min ago
> Main PID: 1631 (slurmctld)
> Tasks: 10
> Memory: 4.0M
> CGroup: /system.slice/slurmctld.service
> ├─1631 /usr/sbin/slurmctld -D -s
> └─1818 slurmctld: slurmscriptd
>
> Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: DECODED: Thu Dec 01 16:17:19 2022
> Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
> out of sync clocks
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge
> decode failed: Rewound credential
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: ENCODED: Fri Dec 02 16:16:55 2022
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: DECODED: Thu Dec 01 16:17:20 2022
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for
> out of sync clocks
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge
> decode failed: Rewound credential
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: ENCODED: Fri Dec 02 16:16:56 2022
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: DECODED: Thu Dec 01 16:17:21 2022
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for
> out of sync clocks____
>
> __ __
>
> When I run the following command on compute nodes I get the following
> output:____
>
> __ __
>
> [gpu101 at 101 ~]$*munge -n | unmunge*____
>
> STATUS: Success (0)
> ENCODE_HOST: ??? (0.0.0.101)
> ENCODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818)
> DECODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818)
> TTL: 300
> CIPHER: aes128 (4)
> MAC: sha1 (3)
> ZIP: none (0)
> UID: gpu101 (1000)
> GID: gpu101 (1000)
> LENGTH: 0____
>
> __ __
>
> Is this error because the encode_host name has question marks and the
> IP is also not picked correctly by munge. How can I correct this? All
> the nodes keep non-responding when I run a job. However, I have all
> the clocks synced across the cluster. ____
>
> __ __
>
> I am new to slurm. Kindly guide me in this matter.____
More information about the slurm-users
mailing list