[slurm-users] [External] ERROR: slurmctld: auth/munge: _print_cred: DECODED
Nousheen
nousheenparvaiz at gmail.com
Thu Dec 1 20:08:54 UTC 2022
Dear Robbert,
Thankyou so much for your response. I was so focused on sync of time that I
missed the date on one of the nodes which was 1 day behind as you said. I
have corrected it and now i get the following output in status.
*(base) [nousheen at nousheen slurm]$ systemctl status slurmctld.service -l*
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor
preset: disabled)
Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago
Main PID: 19475 (slurmctld)
Tasks: 10
Memory: 4.5M
CGroup: /system.slice/slurmctld.service
├─19475 /usr/sbin/slurmctld -D -s
└─19538 slurmctld: slurmscriptd
Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill:
_start_job: Started JobId=106 in debug on 101
Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=106 WEXITSTATUS 1
Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=106 done
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
JobId=107 NodeList=101 #CPUs=8 Partition=debug
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
JobId=108 NodeList=105 #CPUs=8 Partition=debug
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
JobId=109 NodeList=nousheen #CPUs=8 Partition=debug
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=107 WEXITSTATUS 1
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=107 done
Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=108 WEXITSTATUS 1
Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=108 done
I have total four nodes one of which is the server node. After submitting a
job, the job only runs at my server compute node while all the other nodes
are IDLE, DOWN or nonresponding. The details are given below:
*(base) [nousheen at nousheen slurm]$ scontrol show nodes*
NodeName=101 Arch=x86_64 CoresPerSocket=6
CPUAlloc=0 CPUTot=12 CPULoad=0.01
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4
OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022
RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57
LastBusyTime=2022-12-02T00:58:31
CfgTRES=cpu=12,mem=1M,billing=12
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=104 CoresPerSocket=6
CPUAlloc=0 CPUTot=12 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.60.114 NodeHostName=104
RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1
State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
Partitions=debug
BootTime=None SlurmdStartTime=None
LastBusyTime=2022-12-01T21:37:35
CfgTRES=cpu=12,mem=1M,billing=12
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Not responding [slurm at 2022-12-01T16:22:28]
NodeName=105 Arch=x86_64 CoresPerSocket=6
CPUAlloc=0 CPUTot=12 CPULoad=1.08
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4
OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022
RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30
LastBusyTime=2022-12-01T21:47:11
CfgTRES=cpu=12,mem=1M,billing=12
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=nousheen Arch=x86_64 CoresPerSocket=6
CPUAlloc=8 CPUTot=12 CPULoad=6.73
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5
OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42
LastBusyTime=2022-12-01T21:37:39
CfgTRES=cpu=12,mem=1M,billing=12
AllocTRES=cpu=8
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Where as this command shows only one node on which job is running:
*(base) [nousheen at nousheen slurm]$ squeue -j*
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
109 debug SRBD-4 nousheen R 3:17:48 1 nousheen
Can you please guide me as to why my compute nodes are down and not working?
Thank you for your time.
Best Regards,
Nousheen Parvaiz
ᐧ
On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <mrobbert at mines.edu> wrote:
> I believe that the error you need to pay attention to for this issue is
> this line:
>
>
>
> Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for out
> of sync clocks
>
>
>
>
>
> It looks like your compute nodes clock is a full day ahead of your
> controller node. Dec. 2 instead of Dec. 1. The clocks need to be in sync
> for munge to work.
>
>
>
> *Mike Robbert*
>
> *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research
> Computing*
>
> Information and Technology Solutions (ITS)
>
> 303-273-3786 | mrobbert at mines.edu
>
> [image: A close up of a sign Description automatically generated]
>
> *Our values:* Trust | Integrity | Respect | Responsibility
>
>
>
>
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Nousheen <nousheenparvaiz at gmail.com>
> *Date: *Thursday, December 1, 2022 at 06:19
> *To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge:
> _print_cred: DECODED
>
> *CAUTION:* This email originated from outside of the Colorado School of
> Mines organization. Do not click on links or open attachments unless you
> recognize the sender and know the content is safe.
>
>
>
>
>
>
>
> Hello Everyone,
>
>
>
> I am using slurm version 21.08.5 and Centos 7.
>
>
>
> I successfully start slurmd on all compute nodes but when I start
> slurmctld on server node it gives the following error:
>
>
>
> *(base) [nousheen at nousheen ~]$ systemctl status slurmctld.service -l*
> ● slurmctld.service - Slurm controller daemon
> Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor
> preset: disabled)
> Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h 16min ago
> Main PID: 1631 (slurmctld)
> Tasks: 10
> Memory: 4.0M
> CGroup: /system.slice/slurmctld.service
> ├─1631 /usr/sbin/slurmctld -D -s
> └─1818 slurmctld: slurmscriptd
>
> Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: DECODED: Thu Dec 01 16:17:19 2022
> Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for out
> of sync clocks
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge decode
> failed: Rewound credential
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: ENCODED: Fri Dec 02 16:16:55 2022
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: DECODED: Thu Dec 01 16:17:20 2022
> Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for out
> of sync clocks
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge decode
> failed: Rewound credential
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: ENCODED: Fri Dec 02 16:16:56 2022
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
> _print_cred: DECODED: Thu Dec 01 16:17:21 2022
> Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for out
> of sync clocks
>
>
>
> When I run the following command on compute nodes I get the following
> output:
>
>
>
> [gpu101 at 101 ~]$* munge -n | unmunge*
>
> STATUS: Success (0)
> ENCODE_HOST: ??? (0.0.0.101)
> ENCODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818)
> DECODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818)
> TTL: 300
> CIPHER: aes128 (4)
> MAC: sha1 (3)
> ZIP: none (0)
> UID: gpu101 (1000)
> GID: gpu101 (1000)
> LENGTH: 0
>
>
>
> Is this error because the encode_host name has question marks and the IP
> is also not picked correctly by munge. How can I correct this? All the
> nodes keep non-responding when I run a job. However, I have all the clocks
> synced across the cluster.
>
>
>
> I am new to slurm. Kindly guide me in this matter.
>
>
>
>
>
>
> Best Regards,
>
> Nousheen Parvaiz
> Ph.D. Scholar
>
>
>
> ᐧ
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221202/e321f5e2/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 8292 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221202/e321f5e2/attachment-0001.png>
More information about the slurm-users
mailing list