[slurm-users] [External] ERROR: slurmctld: auth/munge: _print_cred: DECODED

Nousheen nousheenparvaiz at gmail.com
Fri Dec 2 16:18:44 UTC 2022


Dear Ole,

Thank you so much for your response. I have now adjusted the RealMemory in
the slurm.conf which was set by default previously. Your insight was really
helpful. Now, when I submit the job, it is running on three nodes but one
node (104) is not responding. The details of some commands are given below.


*[root at nousheen ~]# squeue -j*
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
               120     debug   SRBD-1 nousheen  R       0:54      1 101
               121     debug   SRBD-2 nousheen  R       0:54      1 105
               122     debug   SRBD-3 nousheen  R       0:54      1 nousheen
               123     debug   SRBD-4 nousheen  R       0:54      2
105,nousheen


*[root at nousheen ~]# scontrol show nodes*
NodeName=101 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=8 CPUTot=12 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.60.118 NodeHostName=101 Version=21.08.4
   OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022
   RealMemory=31919 AllocMem=0 FreeMem=293 Sockets=1 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-02T19:56:01
   LastBusyTime=2022-12-02T19:58:14
   CfgTRES=cpu=12,mem=31919M,billing=12
   AllocTRES=cpu=8
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=104 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUTot=12 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.60.104 NodeHostName=104 Version=21.08.4
   OS=Linux 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC 2022
   RealMemory=31889 AllocMem=0 FreeMem=30433 Sockets=1 Boards=1
   State=IDLE+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=debug
   BootTime=2022-11-24T11:15:43 SlurmdStartTime=2022-12-02T19:57:29
   LastBusyTime=2022-12-02T19:58:14
   CfgTRES=cpu=12,mem=31889M,billing=12
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=105 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=12 CPUTot=12 CPULoad=1.03
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.60.105 NodeHostName=105 Version=21.08.4
   OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022
   RealMemory=32051 AllocMem=0 FreeMem=14874 Sockets=1 Boards=1
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=debug
   BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-02T19:56:57
   LastBusyTime=2022-12-02T19:58:14
   CfgTRES=cpu=12,mem=32051M,billing=12
   AllocTRES=cpu=12
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=nousheen Arch=x86_64 CoresPerSocket=6
   CPUAlloc=12 CPUTot=12 CPULoad=0.32
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.60.194 NodeHostName=nousheen Version=21.08.5
   OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
   RealMemory=31889 AllocMem=0 FreeMem=16666 Sockets=1 Boards=1
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=debug
   BootTime=2022-12-01T12:00:18 SlurmdStartTime=2022-12-02T19:56:36
   LastBusyTime=2022-12-02T19:58:15
   CfgTRES=cpu=12,mem=31889M,billing=12
   AllocTRES=cpu=12
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


*[root at 104 ~]# scontrol show slurmd*
Active Steps             = NONE
Actual CPUs              = 12
Actual Boards            = 1
Actual sockets           = 1
Actual cores             = 6
Actual threads per core  = 2
Actual real memory       = 31889 MB
Actual temp disk space   = 106648 MB
Boot time                = 2022-12-02T19:57:29
Hostname                 = 104
Last slurmctld msg time  = NONE
Slurmd PID               = 16906
Slurmd Debug             = 3
Slurmd Logfile           = /var/log/slurmd.log
Version                  = 21.08.4


If you can give me a hint to as what can be the reason behind one node
nonresponding or what files or problems I should focus on, I would be
highly grateful to you. Thank you for your time.

Best regards,

Nousheen

ᐧ

On Fri, Dec 2, 2022 at 11:56 AM Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
wrote:

> Hi Nousheen,
>
> It seems that you have configured incorrectly the nodes in slurm.conf.  I
> notice this:
>
>    RealMemory=1
>
> This means 1 Megabyte of RAM memory, we only had this with IBM PCs back in
> the 1980ies :-)
>
> See how to configure nodes in
>
> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration
>
> You must run "slurmd -C" on each node to determine its actual hardware.
>
> I hope this helps.
>
> /Ole
>
> On 12/1/22 21:08, Nousheen wrote:
> > Dear Robbert,
> >
> > Thankyou so much for your response. I was so focused on sync of time
> that
> > I missed the date on one of the nodes which was 1 day behind as you
> said.
> > I have corrected it and now i get the following output in status.
> >
> > *(base) [nousheen at nousheen slurm]$ systemctl status slurmctld.service
> -l*
> > ● slurmctld.service - Slurm controller daemon
> >     Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> vendor
> > preset: disabled)
> >     Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago
> >   Main PID: 19475 (slurmctld)
> >      Tasks: 10
> >     Memory: 4.5M
> >     CGroup: /system.slice/slurmctld.service
> >             ├─19475 /usr/sbin/slurmctld -D -s
> >             └─19538 slurmctld: slurmscriptd
> >
> > Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill:
> > _start_job: Started JobId=106 in debug on 101
> > Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=106 WEXITSTATUS 1
> > Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=106 done
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
> > JobId=107 NodeList=101 #CPUs=8 Partition=debug
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
> > JobId=108 NodeList=105 #CPUs=8 Partition=debug
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
> > JobId=109 NodeList=nousheen #CPUs=8 Partition=debug
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=107 WEXITSTATUS 1
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=107 done
> > Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=108 WEXITSTATUS 1
> > Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=108 done
> >
> > I have total four nodes one of which is the server node. After
> submitting
> > a job, the job only runs at my server compute node while all the other
> > nodes are IDLE, DOWN or nonresponding. The details are given below:
> >
> > *(base) [nousheen at nousheen slurm]$ scontrol show nodes*
> > NodeName=101 Arch=x86_64 CoresPerSocket=6
> >     CPUAlloc=0 CPUTot=12 CPULoad=0.01
> >     AvailableFeatures=(null)
> >     ActiveFeatures=(null)
> >     Gres=(null)
> >     NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4
> >     OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC
> 2022
> >     RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1
> >     State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
> >     Partitions=debug
> >     BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57
> >     LastBusyTime=2022-12-02T00:58:31
> >     CfgTRES=cpu=12,mem=1M,billing=12
> >     AllocTRES=
> >     CapWatts=n/a
> >     CurrentWatts=0 AveWatts=0
> >     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> > NodeName=104 CoresPerSocket=6
> >     CPUAlloc=0 CPUTot=12 CPULoad=N/A
> >     AvailableFeatures=(null)
> >     ActiveFeatures=(null)
> >     Gres=(null)
> >     NodeAddr=192.168.60.114 NodeHostName=104
> >     RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1
> >     State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1
> > Owner=N/A MCS_label=N/A
> >     Partitions=debug
> >     BootTime=None SlurmdStartTime=None
> >     LastBusyTime=2022-12-01T21:37:35
> >     CfgTRES=cpu=12,mem=1M,billing=12
> >     AllocTRES=
> >     CapWatts=n/a
> >     CurrentWatts=0 AveWatts=0
> >     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >     Reason=Not responding [slurm at 2022-12-01T16:22:28]
> >
> > NodeName=105 Arch=x86_64 CoresPerSocket=6
> >     CPUAlloc=0 CPUTot=12 CPULoad=1.08
> >     AvailableFeatures=(null)
> >     ActiveFeatures=(null)
> >     Gres=(null)
> >     NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4
> >     OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC
> 2022
> >     RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1
> >     State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
> >     Partitions=debug
> >     BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30
> >     LastBusyTime=2022-12-01T21:47:11
> >     CfgTRES=cpu=12,mem=1M,billing=12
> >     AllocTRES=
> >     CapWatts=n/a
> >     CurrentWatts=0 AveWatts=0
> >     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> > NodeName=nousheen Arch=x86_64 CoresPerSocket=6
> >     CPUAlloc=8 CPUTot=12 CPULoad=6.73
> >     AvailableFeatures=(null)
> >     ActiveFeatures=(null)
> >     Gres=(null)
> >     NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5
> >     OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC
> 2021
> >     RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1
> >     State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
> >     Partitions=debug
> >     BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42
> >     LastBusyTime=2022-12-01T21:37:39
> >     CfgTRES=cpu=12,mem=1M,billing=12
> >     AllocTRES=cpu=8
> >     CapWatts=n/a
> >     CurrentWatts=0 AveWatts=0
> >     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> > Where as this command shows only one node on which job is running:
> >
> > *(base) [nousheen at nousheen slurm]$ squeue -j*
> >               JOBID PARTITION     NAME     USER ST       TIME  NODES
> > NODELIST(REASON)
> >                 109     debug   SRBD-4 nousheen  R    3:17:48      1
> nousheen
> >
> > Can you please guide me as to why my compute nodes are down and not
> working?
> >
> > Thank you for your time.
> >
> >
> > Best Regards,
> > Nousheen Parvaiz
> >
> >
> > ᐧ
> >
> > On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <mrobbert at mines.edu
> > <mailto:mrobbert at mines.edu>> wrote:
> >
> >     I believe that the error you need to pay attention to for this issue
> >     is this line:____
> >
> >     __ __
> >
> >     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
> >     out of sync clocks____
> >
> >     __ __
> >
> >     __ __
> >
> >     It looks like your compute nodes clock is a full day ahead of your
> >     controller node. Dec. 2 instead of Dec. 1. The clocks need to be in
> >     sync for munge to work.____
> >
> >     __ __
> >
> >     *Mike Robbert*____
> >
> >     *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced
> >     Research Computing*____
> >
> >     Information and Technology Solutions (ITS)____
> >
> >     303-273-3786 | mrobbert at mines.edu <mailto:mrobbert at mines.edu>____
> >
> >     A close up of a sign Description automatically generated____
> >
> >     *Our values:*Trust | Integrity | Respect | Responsibility____
> >
> >     __ __
> >
> >     __ __
> >
> >     *From: *slurm-users <slurm-users-bounces at lists.schedmd.com
> >     <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of
> Nousheen
> >     <nousheenparvaiz at gmail.com <mailto:nousheenparvaiz at gmail.com>>
> >     *Date: *Thursday, December 1, 2022 at 06:19
> >     *To: *Slurm User Community List <slurm-users at lists.schedmd.com
> >     <mailto:slurm-users at lists.schedmd.com>>
> >     *Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge:
> >     _print_cred: DECODED____
> >
> >     *CAUTION:*This email originated from outside of the Colorado School
> of
> >     Mines organization. Do not click on links or open attachments unless
> >     you recognize the sender and know the content is safe.____
> >
> >     __ __
> >
> >     __ __
> >
> >     __ __
> >
> >     Hello Everyone,____
> >
> >     __ __
> >
> >     I am using slurm version 21.08.5 and Centos 7.____
> >
> >     __ __
> >
> >       I successfully start slurmd on all compute nodes but when I start
> >     slurmctld on server node it gives the following error:____
> >
> >     __ __
> >
> >     *(base) [nousheen at nousheen ~]$ systemctl status slurmctld.service
> -l*
> >     ● slurmctld.service - Slurm controller daemon
> >         Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> >     vendor preset: disabled)
> >         Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h
> >     16min ago
> >       Main PID: 1631 (slurmctld)
> >          Tasks: 10
> >         Memory: 4.0M
> >         CGroup: /system.slice/slurmctld.service
> >     ├─1631 /usr/sbin/slurmctld -D -s
> >                 └─1818 slurmctld: slurmscriptd
> >
> >     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge:
> >     _print_cred: DECODED: Thu Dec 01 16:17:19 2022
> >     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
> >     out of sync clocks
> >     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge
> >     decode failed: Rewound credential
> >     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
> >     _print_cred: ENCODED: Fri Dec 02 16:16:55 2022
> >     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
> >     _print_cred: DECODED: Thu Dec 01 16:17:20 2022
> >     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for
> >     out of sync clocks
> >     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge
> >     decode failed: Rewound credential
> >     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
> >     _print_cred: ENCODED: Fri Dec 02 16:16:56 2022
> >     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
> >     _print_cred: DECODED: Thu Dec 01 16:17:21 2022
> >     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for
> >     out of sync clocks____
> >
> >     __ __
> >
> >     When I run the following command on compute nodes I get the following
> >     output:____
> >
> >     __ __
> >
> >       [gpu101 at 101 ~]$*munge -n | unmunge*____
> >
> >     STATUS:           Success (0)
> >     ENCODE_HOST:      ??? (0.0.0.101)
> >     ENCODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)
> >     DECODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)
> >     TTL:              300
> >     CIPHER:           aes128 (4)
> >     MAC:              sha1 (3)
> >     ZIP:              none (0)
> >     UID:              gpu101 (1000)
> >     GID:              gpu101 (1000)
> >     LENGTH:           0____
> >
> >     __ __
> >
> >     Is this error because the encode_host name has question marks and the
> >     IP is also not picked correctly by munge. How can I correct this? All
> >     the nodes keep non-responding when I run a job. However, I have all
> >     the clocks synced across the cluster. ____
> >
> >     __ __
> >
> >     I am new to slurm. Kindly guide me in this matter.____
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221202/e8f58066/attachment-0001.htm>


More information about the slurm-users mailing list