[slurm-users] [External] ERROR: slurmctld: auth/munge: _print_cred: DECODED
Nousheen
nousheenparvaiz at gmail.com
Fri Dec 2 16:18:44 UTC 2022
Dear Ole,
Thank you so much for your response. I have now adjusted the RealMemory in
the slurm.conf which was set by default previously. Your insight was really
helpful. Now, when I submit the job, it is running on three nodes but one
node (104) is not responding. The details of some commands are given below.
*[root at nousheen ~]# squeue -j*
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
120 debug SRBD-1 nousheen R 0:54 1 101
121 debug SRBD-2 nousheen R 0:54 1 105
122 debug SRBD-3 nousheen R 0:54 1 nousheen
123 debug SRBD-4 nousheen R 0:54 2
105,nousheen
*[root at nousheen ~]# scontrol show nodes*
NodeName=101 Arch=x86_64 CoresPerSocket=6
CPUAlloc=8 CPUTot=12 CPULoad=0.01
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.60.118 NodeHostName=101 Version=21.08.4
OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022
RealMemory=31919 AllocMem=0 FreeMem=293 Sockets=1 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-02T19:56:01
LastBusyTime=2022-12-02T19:58:14
CfgTRES=cpu=12,mem=31919M,billing=12
AllocTRES=cpu=8
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=104 Arch=x86_64 CoresPerSocket=6
CPUAlloc=0 CPUTot=12 CPULoad=0.01
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.60.104 NodeHostName=104 Version=21.08.4
OS=Linux 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC 2022
RealMemory=31889 AllocMem=0 FreeMem=30433 Sockets=1 Boards=1
State=IDLE+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
Partitions=debug
BootTime=2022-11-24T11:15:43 SlurmdStartTime=2022-12-02T19:57:29
LastBusyTime=2022-12-02T19:58:14
CfgTRES=cpu=12,mem=31889M,billing=12
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=105 Arch=x86_64 CoresPerSocket=6
CPUAlloc=12 CPUTot=12 CPULoad=1.03
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.60.105 NodeHostName=105 Version=21.08.4
OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022
RealMemory=32051 AllocMem=0 FreeMem=14874 Sockets=1 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
Partitions=debug
BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-02T19:56:57
LastBusyTime=2022-12-02T19:58:14
CfgTRES=cpu=12,mem=32051M,billing=12
AllocTRES=cpu=12
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=nousheen Arch=x86_64 CoresPerSocket=6
CPUAlloc=12 CPUTot=12 CPULoad=0.32
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.60.194 NodeHostName=nousheen Version=21.08.5
OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
RealMemory=31889 AllocMem=0 FreeMem=16666 Sockets=1 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
Partitions=debug
BootTime=2022-12-01T12:00:18 SlurmdStartTime=2022-12-02T19:56:36
LastBusyTime=2022-12-02T19:58:15
CfgTRES=cpu=12,mem=31889M,billing=12
AllocTRES=cpu=12
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
*[root at 104 ~]# scontrol show slurmd*
Active Steps = NONE
Actual CPUs = 12
Actual Boards = 1
Actual sockets = 1
Actual cores = 6
Actual threads per core = 2
Actual real memory = 31889 MB
Actual temp disk space = 106648 MB
Boot time = 2022-12-02T19:57:29
Hostname = 104
Last slurmctld msg time = NONE
Slurmd PID = 16906
Slurmd Debug = 3
Slurmd Logfile = /var/log/slurmd.log
Version = 21.08.4
If you can give me a hint to as what can be the reason behind one node
nonresponding or what files or problems I should focus on, I would be
highly grateful to you. Thank you for your time.
Best regards,
Nousheen
ᐧ
On Fri, Dec 2, 2022 at 11:56 AM Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
wrote:
> Hi Nousheen,
>
> It seems that you have configured incorrectly the nodes in slurm.conf. I
> notice this:
>
> RealMemory=1
>
> This means 1 Megabyte of RAM memory, we only had this with IBM PCs back in
> the 1980ies :-)
>
> See how to configure nodes in
>
> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration
>
> You must run "slurmd -C" on each node to determine its actual hardware.
>
> I hope this helps.
>
> /Ole
>
> On 12/1/22 21:08, Nousheen wrote:
> > Dear Robbert,
> >
> > Thankyou so much for your response. I was so focused on sync of time
> that
> > I missed the date on one of the nodes which was 1 day behind as you
> said.
> > I have corrected it and now i get the following output in status.
> >
> > *(base) [nousheen at nousheen slurm]$ systemctl status slurmctld.service
> -l*
> > ● slurmctld.service - Slurm controller daemon
> > Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> vendor
> > preset: disabled)
> > Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago
> > Main PID: 19475 (slurmctld)
> > Tasks: 10
> > Memory: 4.5M
> > CGroup: /system.slice/slurmctld.service
> > ├─19475 /usr/sbin/slurmctld -D -s
> > └─19538 slurmctld: slurmscriptd
> >
> > Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill:
> > _start_job: Started JobId=106 in debug on 101
> > Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=106 WEXITSTATUS 1
> > Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=106 done
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
> > JobId=107 NodeList=101 #CPUs=8 Partition=debug
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
> > JobId=108 NodeList=105 #CPUs=8 Partition=debug
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
> > JobId=109 NodeList=nousheen #CPUs=8 Partition=debug
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=107 WEXITSTATUS 1
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=107 done
> > Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=108 WEXITSTATUS 1
> > Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=108 done
> >
> > I have total four nodes one of which is the server node. After
> submitting
> > a job, the job only runs at my server compute node while all the other
> > nodes are IDLE, DOWN or nonresponding. The details are given below:
> >
> > *(base) [nousheen at nousheen slurm]$ scontrol show nodes*
> > NodeName=101 Arch=x86_64 CoresPerSocket=6
> > CPUAlloc=0 CPUTot=12 CPULoad=0.01
> > AvailableFeatures=(null)
> > ActiveFeatures=(null)
> > Gres=(null)
> > NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4
> > OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC
> 2022
> > RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1
> > State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
> > Partitions=debug
> > BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57
> > LastBusyTime=2022-12-02T00:58:31
> > CfgTRES=cpu=12,mem=1M,billing=12
> > AllocTRES=
> > CapWatts=n/a
> > CurrentWatts=0 AveWatts=0
> > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> > NodeName=104 CoresPerSocket=6
> > CPUAlloc=0 CPUTot=12 CPULoad=N/A
> > AvailableFeatures=(null)
> > ActiveFeatures=(null)
> > Gres=(null)
> > NodeAddr=192.168.60.114 NodeHostName=104
> > RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1
> > State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1
> > Owner=N/A MCS_label=N/A
> > Partitions=debug
> > BootTime=None SlurmdStartTime=None
> > LastBusyTime=2022-12-01T21:37:35
> > CfgTRES=cpu=12,mem=1M,billing=12
> > AllocTRES=
> > CapWatts=n/a
> > CurrentWatts=0 AveWatts=0
> > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> > Reason=Not responding [slurm at 2022-12-01T16:22:28]
> >
> > NodeName=105 Arch=x86_64 CoresPerSocket=6
> > CPUAlloc=0 CPUTot=12 CPULoad=1.08
> > AvailableFeatures=(null)
> > ActiveFeatures=(null)
> > Gres=(null)
> > NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4
> > OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC
> 2022
> > RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1
> > State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
> > Partitions=debug
> > BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30
> > LastBusyTime=2022-12-01T21:47:11
> > CfgTRES=cpu=12,mem=1M,billing=12
> > AllocTRES=
> > CapWatts=n/a
> > CurrentWatts=0 AveWatts=0
> > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> > NodeName=nousheen Arch=x86_64 CoresPerSocket=6
> > CPUAlloc=8 CPUTot=12 CPULoad=6.73
> > AvailableFeatures=(null)
> > ActiveFeatures=(null)
> > Gres=(null)
> > NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5
> > OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC
> 2021
> > RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1
> > State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
> > Partitions=debug
> > BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42
> > LastBusyTime=2022-12-01T21:37:39
> > CfgTRES=cpu=12,mem=1M,billing=12
> > AllocTRES=cpu=8
> > CapWatts=n/a
> > CurrentWatts=0 AveWatts=0
> > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> > Where as this command shows only one node on which job is running:
> >
> > *(base) [nousheen at nousheen slurm]$ squeue -j*
> > JOBID PARTITION NAME USER ST TIME NODES
> > NODELIST(REASON)
> > 109 debug SRBD-4 nousheen R 3:17:48 1
> nousheen
> >
> > Can you please guide me as to why my compute nodes are down and not
> working?
> >
> > Thank you for your time.
> >
> >
> > Best Regards,
> > Nousheen Parvaiz
> >
> >
> > ᐧ
> >
> > On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <mrobbert at mines.edu
> > <mailto:mrobbert at mines.edu>> wrote:
> >
> > I believe that the error you need to pay attention to for this issue
> > is this line:____
> >
> > __ __
> >
> > Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
> > out of sync clocks____
> >
> > __ __
> >
> > __ __
> >
> > It looks like your compute nodes clock is a full day ahead of your
> > controller node. Dec. 2 instead of Dec. 1. The clocks need to be in
> > sync for munge to work.____
> >
> > __ __
> >
> > *Mike Robbert*____
> >
> > *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced
> > Research Computing*____
> >
> > Information and Technology Solutions (ITS)____
> >
> > 303-273-3786 | mrobbert at mines.edu <mailto:mrobbert at mines.edu>____
> >
> > A close up of a sign Description automatically generated____
> >
> > *Our values:*Trust | Integrity | Respect | Responsibility____
> >
> > __ __
> >
> > __ __
> >
> > *From: *slurm-users <slurm-users-bounces at lists.schedmd.com
> > <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of
> Nousheen
> > <nousheenparvaiz at gmail.com <mailto:nousheenparvaiz at gmail.com>>
> > *Date: *Thursday, December 1, 2022 at 06:19
> > *To: *Slurm User Community List <slurm-users at lists.schedmd.com
> > <mailto:slurm-users at lists.schedmd.com>>
> > *Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge:
> > _print_cred: DECODED____
> >
> > *CAUTION:*This email originated from outside of the Colorado School
> of
> > Mines organization. Do not click on links or open attachments unless
> > you recognize the sender and know the content is safe.____
> >
> > __ __
> >
> > __ __
> >
> > __ __
> >
> > Hello Everyone,____
> >
> > __ __
> >
> > I am using slurm version 21.08.5 and Centos 7.____
> >
> > __ __
> >
> > I successfully start slurmd on all compute nodes but when I start
> > slurmctld on server node it gives the following error:____
> >
> > __ __
> >
> > *(base) [nousheen at nousheen ~]$ systemctl status slurmctld.service
> -l*
> > ● slurmctld.service - Slurm controller daemon
> > Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> > vendor preset: disabled)
> > Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h
> > 16min ago
> > Main PID: 1631 (slurmctld)
> > Tasks: 10
> > Memory: 4.0M
> > CGroup: /system.slice/slurmctld.service
> > ├─1631 /usr/sbin/slurmctld -D -s
> > └─1818 slurmctld: slurmscriptd
> >
> > Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge:
> > _print_cred: DECODED: Thu Dec 01 16:17:19 2022
> > Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
> > out of sync clocks
> > Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge
> > decode failed: Rewound credential
> > Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
> > _print_cred: ENCODED: Fri Dec 02 16:16:55 2022
> > Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
> > _print_cred: DECODED: Thu Dec 01 16:17:20 2022
> > Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for
> > out of sync clocks
> > Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge
> > decode failed: Rewound credential
> > Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
> > _print_cred: ENCODED: Fri Dec 02 16:16:56 2022
> > Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
> > _print_cred: DECODED: Thu Dec 01 16:17:21 2022
> > Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for
> > out of sync clocks____
> >
> > __ __
> >
> > When I run the following command on compute nodes I get the following
> > output:____
> >
> > __ __
> >
> > [gpu101 at 101 ~]$*munge -n | unmunge*____
> >
> > STATUS: Success (0)
> > ENCODE_HOST: ??? (0.0.0.101)
> > ENCODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818)
> > DECODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818)
> > TTL: 300
> > CIPHER: aes128 (4)
> > MAC: sha1 (3)
> > ZIP: none (0)
> > UID: gpu101 (1000)
> > GID: gpu101 (1000)
> > LENGTH: 0____
> >
> > __ __
> >
> > Is this error because the encode_host name has question marks and the
> > IP is also not picked correctly by munge. How can I correct this? All
> > the nodes keep non-responding when I run a job. However, I have all
> > the clocks synced across the cluster. ____
> >
> > __ __
> >
> > I am new to slurm. Kindly guide me in this matter.____
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221202/e8f58066/attachment-0001.htm>
More information about the slurm-users
mailing list