[slurm-users] [External] ERROR: slurmctld: auth/munge: _print_cred: DECODED

Nousheen nousheenparvaiz at gmail.com
Sat Dec 3 19:30:52 UTC 2022


Hello Michael,

Thank you so much for your response. I have taken my time to explore the
problem. The network seems stable, all nodes are connected to the same
network where three are working but this 104 is down. 104 system has the
same hardware specifications as all the others. Installations were done
side by side on every node. I have tried restarting slurm several times but
the node goes from idle* to down*. The details of some commands are given
below.


*[root at 104 ~]# systemctl status slurmd.service*
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor
preset: disabled)
   Active: active (running) since Sat 2022-12-03 22:50:00 PKT; 11min ago
 Main PID: 18754 (slurmd)
    Tasks: 1
   Memory: 672.0K
   CGroup: /system.slice/slurmd.service
           └─18754 /usr/sbin/slurmd -D -s

Dec 03 22:50:00 104 systemd[1]: Started Slurm node daemon.
Dec 03 22:50:00 104 slurmd[18754]: slurmd: slurmd version 21.08.4 started
Dec 03 22:50:00 104 slurmd[18754]: slurmd: killing old slurmd[18744]

*[root at nousheen ~]# ping 192.168.60.104*
PING 192.168.60.104 (192.168.60.104) 56(84) bytes of data.
64 bytes from 192.168.60.104: icmp_seq=1 ttl=64 time=0.284 ms
64 bytes from 192.168.60.104: icmp_seq=2 ttl=64 time=0.254 ms
64 bytes from 192.168.60.104: icmp_seq=3 ttl=64 time=0.269 ms
64 bytes from 192.168.60.104: icmp_seq=4 ttl=64 time=0.260 ms
64 bytes from 192.168.60.104: icmp_seq=5 ttl=64 time=0.275 ms
64 bytes from 192.168.60.104: icmp_seq=6 ttl=64 time=0.262 ms

*[root at nousheen ~]# vi /var/log/slurmctld.log*
[2022-12-02T20:03:14.428] error: Nodes 104 not responding
[2022-12-02T20:04:58.686] error: Nodes 104 not responding, setting DOWN
[2022-12-02T20:54:09.863] Node 104 now responding
[2022-12-02T20:54:09.863] node 104 returned to service
[2022-12-02T20:58:14.691] error: Nodes 104 not responding
[2022-12-02T21:01:38.007] error: Nodes 104 not responding, setting DOWN
[2022-12-03T13:58:40.878] update_node: node 104 state set to IDLE
[2022-12-03T14:03:14.142] error: Nodes 104 not responding
[2022-12-03T14:05:20.392] error: Nodes 104 not responding, setting DOWN
[2022-12-03T22:02:40.680] _job_complete: JobId=120 WEXITSTATUS 0
[2022-12-03T22:02:40.680] _job_complete: JobId=120 done
[2022-12-03T22:33:36.036] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=121
uid 1000
[2022-12-03T22:33:36.064] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=122
uid 1000
[2022-12-03T22:33:36.118] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=123
uid 1000
[2022-12-03T22:43:18.943] update_node: node 104 state set to IDLE
[2022-12-03T22:48:14.660] error: Nodes 104 not responding
[2022-12-03T22:48:15.257] Node 104 now responding
[2022-12-03T22:53:14.142] error: Nodes 104 not responding
[2022-12-03T22:54:59.397] error: Nodes 104 not responding, setting DOWN
[2022-12-03T23:15:30.236] _slurm_rpc_submit_batch_job: JobId=124
InitPrio=4294901748 usec=588
[2022-12-03T23:15:30.274] sched/backfill: _start_job: Started JobId=124 in
debug on 101


*[root at 104 ~]# vi /var/log/slurmd.log*
[2022-12-03T22:48:18.255] debug:  jobacct_gather/none: init: Job accounting
gather NOT_INVOKED plugin loaded
[2022-12-03T22:48:18.255] debug:  job_container/none: init: job_container
none plugin loaded
[2022-12-03T22:48:18.256] debug:  switch Cray/Aries plugin loaded.
[2022-12-03T22:48:18.256] debug:  switch/none: init: switch NONE plugin
loaded
[2022-12-03T22:48:18.257] slurmd started on Sat, 03 Dec 2022 22:48:18 +0500
[2022-12-03T22:48:18.262] CPUs=12 Boards=1 Sockets=1 Cores=6 Threads=2
Memory=31889 TmpDisk=106648 Uptime=819160 CPUSpecList=(null)
FeaturesAvail=(null) FeaturesActive=(null)
[2022-12-03T22:48:18.262] debug:  acct_gather_energy/none: init:
AcctGatherEnergy NONE plugin loaded
[2022-12-03T22:48:18.263] debug:  acct_gather_Profile/none: init:
AcctGatherProfile NONE plugin loaded
[2022-12-03T22:48:18.263] debug:  acct_gather_interconnect/none: init:
AcctGatherInterconnect NONE plugin loaded
[2022-12-03T22:48:18.263] debug:  acct_gather_filesystem/none: init:
AcctGatherFilesystem NONE plugin loaded
[2022-12-03T22:48:18.263] debug2: No acct_gather.conf file
(/etc/slurm/acct_gather.conf)
[2022-12-03T22:48:18.265] debug:  _handle_node_reg_resp: slurmctld sent
back 8 TRES.
[2022-12-03T22:50:00.284] slurmd version 21.08.4 started
[2022-12-03T22:50:00.284] killing old slurmd[18744]

currently, only one job is running on 101 which shows mix state. I don't
understand why is 101 in mix state when the only thing running on it is one
job. secondly, what does asterisk means for 104 (down*)?

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  down* 104
debug*       up   infinite      1    mix 101
debug*       up   infinite      2   idle 105,nousheen


The output given in slurmd.log is beyond my understanding. I have searched
for it on the internet but failed to find a proper solution. Any kind of
help in understanding the problem would be highly appreciated. Thank you
for your time.

Best regards,

Nousheen


ᐧ

On Fri, Dec 2, 2022 at 10:42 PM Michael Robbert <mrobbert at mines.edu> wrote:

> Nousheen,
>
> When a node is not responding the first place to start is to ensure that
> the node is up and slurmd is running. It looks like you have confirmed that
> with your output from the command “scontrol show slurmd” so that is a good
> start. After verifying that slurmd is running the next step would be to
> examine the logs. Look at /var/log/slurmd.log on that node and see if it
> tells you why it can’t communicate with the slurm controller.
>
> Other things to think about since this is a new setup are to make sure the
> network is stable and that DNS is working properly for all nodes. Make sure
> that all nodes in the cluster can do correct DNS resolution of all other
> nodes in the cluster.
>
>
>
> *Mike Robbert*
>
> *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research
> Computing*
>
> Information and Technology Solutions (ITS)
>
> 303-273-3786 | mrobbert at mines.edu
>
> [image: A close up of a sign Description automatically generated]
>
> *Our values:* Trust | Integrity | Respect | Responsibility
>
>
>
>
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Nousheen <nousheenparvaiz at gmail.com>
> *Date: *Friday, December 2, 2022 at 09:22
> *To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject: *Re: [slurm-users] [External] ERROR: slurmctld: auth/munge:
> _print_cred: DECODED
>
> *CAUTION:* This email originated from outside of the Colorado School of
> Mines organization. Do not click on links or open attachments unless you
> recognize the sender and know the content is safe.
>
>
>
>
> Dear Ole,
>
> Thank you so much for your response. I have now adjusted the RealMemory in
> the slurm.conf which was set by default previously. Your insight was really
> helpful. Now, when I submit the job, it is running on three nodes but one
> node (104) is not responding. The details of some commands are given below.
>
>
> *[root at nousheen ~]# squeue -j*
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>                120     debug   SRBD-1 nousheen  R       0:54      1 101
>                121     debug   SRBD-2 nousheen  R       0:54      1 105
>                122     debug   SRBD-3 nousheen  R       0:54      1
> nousheen
>                123     debug   SRBD-4 nousheen  R       0:54      2
> 105,nousheen
>
>
> *[root at nousheen ~]# scontrol show nodes*
> NodeName=101 Arch=x86_64 CoresPerSocket=6
>    CPUAlloc=8 CPUTot=12 CPULoad=0.01
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=(null)
>    NodeAddr=192.168.60.118 NodeHostName=101 Version=21.08.4
>    OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC
> 2022
>    RealMemory=31919 AllocMem=0 FreeMem=293 Sockets=1 Boards=1
>    State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>    Partitions=debug
>    BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-02T19:56:01
>    LastBusyTime=2022-12-02T19:58:14
>    CfgTRES=cpu=12,mem=31919M,billing=12
>    AllocTRES=cpu=8
>    CapWatts=n/a
>    CurrentWatts=0 AveWatts=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> NodeName=104 Arch=x86_64 CoresPerSocket=6
>    CPUAlloc=0 CPUTot=12 CPULoad=0.01
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=(null)
>    NodeAddr=192.168.60.104 NodeHostName=104 Version=21.08.4
>    OS=Linux 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC
> 2022
>    RealMemory=31889 AllocMem=0 FreeMem=30433 Sockets=1 Boards=1
>    State=IDLE+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
>    Partitions=debug
>    BootTime=2022-11-24T11:15:43 SlurmdStartTime=2022-12-02T19:57:29
>    LastBusyTime=2022-12-02T19:58:14
>    CfgTRES=cpu=12,mem=31889M,billing=12
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=0 AveWatts=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> NodeName=105 Arch=x86_64 CoresPerSocket=6
>    CPUAlloc=12 CPUTot=12 CPULoad=1.03
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=(null)
>    NodeAddr=192.168.60.105 NodeHostName=105 Version=21.08.4
>    OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC
> 2022
>    RealMemory=32051 AllocMem=0 FreeMem=14874 Sockets=1 Boards=1
>    State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
>    Partitions=debug
>    BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-02T19:56:57
>    LastBusyTime=2022-12-02T19:58:14
>    CfgTRES=cpu=12,mem=32051M,billing=12
>    AllocTRES=cpu=12
>    CapWatts=n/a
>    CurrentWatts=0 AveWatts=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> NodeName=nousheen Arch=x86_64 CoresPerSocket=6
>    CPUAlloc=12 CPUTot=12 CPULoad=0.32
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=(null)
>    NodeAddr=192.168.60.194 NodeHostName=nousheen Version=21.08.5
>    OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
>    RealMemory=31889 AllocMem=0 FreeMem=16666 Sockets=1 Boards=1
>    State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
>    Partitions=debug
>    BootTime=2022-12-01T12:00:18 SlurmdStartTime=2022-12-02T19:56:36
>    LastBusyTime=2022-12-02T19:58:15
>    CfgTRES=cpu=12,mem=31889M,billing=12
>    AllocTRES=cpu=12
>    CapWatts=n/a
>    CurrentWatts=0 AveWatts=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
> *[root at 104 ~]# scontrol show slurmd*
> Active Steps             = NONE
> Actual CPUs              = 12
> Actual Boards            = 1
> Actual sockets           = 1
> Actual cores             = 6
> Actual threads per core  = 2
> Actual real memory       = 31889 MB
> Actual temp disk space   = 106648 MB
> Boot time                = 2022-12-02T19:57:29
> Hostname                 = 104
> Last slurmctld msg time  = NONE
> Slurmd PID               = 16906
> Slurmd Debug             = 3
> Slurmd Logfile           = /var/log/slurmd.log
> Version                  = 21.08.4
>
>
> If you can give me a hint to as what can be the reason behind one node
> nonresponding or what files or problems I should focus on, I would be
> highly grateful to you. Thank you for your time.
>
> Best regards,
>
> Nousheen
>
>
>
>>
>
>
> On Fri, Dec 2, 2022 at 11:56 AM Ole Holm Nielsen <
> Ole.H.Nielsen at fysik.dtu.dk> wrote:
>
> Hi Nousheen,
>
> It seems that you have configured incorrectly the nodes in slurm.conf.  I
> notice this:
>
>    RealMemory=1
>
> This means 1 Megabyte of RAM memory, we only had this with IBM PCs back in
> the 1980ies :-)
>
> See how to configure nodes in
>
> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.fysik.dtu.dk%2FNiflheim_system%2FSlurm_configuration%2F%23compute-node-configuration&data=05%7C01%7Cmrobbert%40mines.edu%7C76f719a37bbf421170de08dad4817173%7C997209e009b346239a4d76afa44a675c%7C0%7C0%7C638055949673005416%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YpIYqfpJPUQgDUpxXO2GsX5kn7GpkD5DJsysANsQhmQ%3D&reserved=0>
>
> You must run "slurmd -C" on each node to determine its actual hardware.
>
> I hope this helps.
>
> /Ole
>
> On 12/1/22 21:08, Nousheen wrote:
> > Dear Robbert,
> >
> > Thankyou so much for your response. I was so focused on sync of time
> that
> > I missed the date on one of the nodes which was 1 day behind as you
> said.
> > I have corrected it and now i get the following output in status.
> >
> > *(base) [nousheen at nousheen slurm]$ systemctl status slurmctld.service
> -l*
> > ● slurmctld.service - Slurm controller daemon
> >     Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> vendor
> > preset: disabled)
> >     Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago
> >   Main PID: 19475 (slurmctld)
> >      Tasks: 10
> >     Memory: 4.5M
> >     CGroup: /system.slice/slurmctld.service
> >             ├─19475 /usr/sbin/slurmctld -D -s
> >             └─19538 slurmctld: slurmscriptd
> >
> > Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill:
> > _start_job: Started JobId=106 in debug on 101
> > Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=106 WEXITSTATUS 1
> > Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=106 done
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
> > JobId=107 NodeList=101 #CPUs=8 Partition=debug
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
> > JobId=108 NodeList=105 #CPUs=8 Partition=debug
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
> > JobId=109 NodeList=nousheen #CPUs=8 Partition=debug
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=107 WEXITSTATUS 1
> > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=107 done
> > Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=108 WEXITSTATUS 1
> > Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
> > JobId=108 done
> >
> > I have total four nodes one of which is the server node. After
> submitting
> > a job, the job only runs at my server compute node while all the other
> > nodes are IDLE, DOWN or nonresponding. The details are given below:
> >
> > *(base) [nousheen at nousheen slurm]$ scontrol show nodes*
> > NodeName=101 Arch=x86_64 CoresPerSocket=6
> >     CPUAlloc=0 CPUTot=12 CPULoad=0.01
> >     AvailableFeatures=(null)
> >     ActiveFeatures=(null)
> >     Gres=(null)
> >     NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4
> >     OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC
> 2022
> >     RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1
> >     State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
> >     Partitions=debug
> >     BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57
> >     LastBusyTime=2022-12-02T00:58:31
> >     CfgTRES=cpu=12,mem=1M,billing=12
> >     AllocTRES=
> >     CapWatts=n/a
> >     CurrentWatts=0 AveWatts=0
> >     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> > NodeName=104 CoresPerSocket=6
> >     CPUAlloc=0 CPUTot=12 CPULoad=N/A
> >     AvailableFeatures=(null)
> >     ActiveFeatures=(null)
> >     Gres=(null)
> >     NodeAddr=192.168.60.114 NodeHostName=104
> >     RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1
> >     State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1
> > Owner=N/A MCS_label=N/A
> >     Partitions=debug
> >     BootTime=None SlurmdStartTime=None
> >     LastBusyTime=2022-12-01T21:37:35
> >     CfgTRES=cpu=12,mem=1M,billing=12
> >     AllocTRES=
> >     CapWatts=n/a
> >     CurrentWatts=0 AveWatts=0
> >     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >     Reason=Not responding [slurm at 2022-12-01T16:22:28]
> >
> > NodeName=105 Arch=x86_64 CoresPerSocket=6
> >     CPUAlloc=0 CPUTot=12 CPULoad=1.08
> >     AvailableFeatures=(null)
> >     ActiveFeatures=(null)
> >     Gres=(null)
> >     NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4
> >     OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC
> 2022
> >     RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1
> >     State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
> >     Partitions=debug
> >     BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30
> >     LastBusyTime=2022-12-01T21:47:11
> >     CfgTRES=cpu=12,mem=1M,billing=12
> >     AllocTRES=
> >     CapWatts=n/a
> >     CurrentWatts=0 AveWatts=0
> >     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> > NodeName=nousheen Arch=x86_64 CoresPerSocket=6
> >     CPUAlloc=8 CPUTot=12 CPULoad=6.73
> >     AvailableFeatures=(null)
> >     ActiveFeatures=(null)
> >     Gres=(null)
> >     NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5
> >     OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC
> 2021
> >     RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1
> >     State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
> >     Partitions=debug
> >     BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42
> >     LastBusyTime=2022-12-01T21:37:39
> >     CfgTRES=cpu=12,mem=1M,billing=12
> >     AllocTRES=cpu=8
> >     CapWatts=n/a
> >     CurrentWatts=0 AveWatts=0
> >     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> > Where as this command shows only one node on which job is running:
> >
> > *(base) [nousheen at nousheen slurm]$ squeue -j*
> >               JOBID PARTITION     NAME     USER ST       TIME  NODES
> > NODELIST(REASON)
> >                 109     debug   SRBD-4 nousheen  R    3:17:48      1
> nousheen
> >
> > Can you please guide me as to why my compute nodes are down and not
> working?
> >
> > Thank you for your time.
> >
> >
> > Best Regards,
> > Nousheen Parvaiz
> >
> >
> > ᐧ
> >
> > On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <mrobbert at mines.edu
> > <mailto:mrobbert at mines.edu>> wrote:
> >
> >     I believe that the error you need to pay attention to for this issue
> >     is this line:____
> >
> >     __ __
> >
> >     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
> >     out of sync clocks____
> >
> >     __ __
> >
> >     __ __
> >
> >     It looks like your compute nodes clock is a full day ahead of your
> >     controller node. Dec. 2 instead of Dec. 1. The clocks need to be in
> >     sync for munge to work.____
> >
> >     __ __
> >
> >     *Mike Robbert*____
> >
> >     *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced
> >     Research Computing*____
> >
> >     Information and Technology Solutions (ITS)____
> >
> >     303-273-3786 | mrobbert at mines.edu <mailto:mrobbert at mines.edu>____
> >
> >     A close up of a sign Description automatically generated____
> >
> >     *Our values:*Trust | Integrity | Respect | Responsibility____
> >
> >     __ __
> >
> >     __ __
> >
> >     *From: *slurm-users <slurm-users-bounces at lists.schedmd.com
> >     <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of
> Nousheen
> >     <nousheenparvaiz at gmail.com <mailto:nousheenparvaiz at gmail.com>>
> >     *Date: *Thursday, December 1, 2022 at 06:19
> >     *To: *Slurm User Community List <slurm-users at lists.schedmd.com
> >     <mailto:slurm-users at lists.schedmd.com>>
> >     *Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge:
> >     _print_cred: DECODED____
> >
> >     *CAUTION:*This email originated from outside of the Colorado School
> of
> >     Mines organization. Do not click on links or open attachments unless
> >     you recognize the sender and know the content is safe.____
> >
> >     __ __
> >
> >     __ __
> >
> >     __ __
> >
> >     Hello Everyone,____
> >
> >     __ __
> >
> >     I am using slurm version 21.08.5 and Centos 7.____
> >
> >     __ __
> >
> >       I successfully start slurmd on all compute nodes but when I start
> >     slurmctld on server node it gives the following error:____
> >
> >     __ __
> >
> >     *(base) [nousheen at nousheen ~]$ systemctl status slurmctld.service
> -l*
> >     ● slurmctld.service - Slurm controller daemon
> >         Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> >     vendor preset: disabled)
> >         Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h
> >     16min ago
> >       Main PID: 1631 (slurmctld)
> >          Tasks: 10
> >         Memory: 4.0M
> >         CGroup: /system.slice/slurmctld.service
> >     ├─1631 /usr/sbin/slurmctld -D -s
> >                 └─1818 slurmctld: slurmscriptd
> >
> >     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge:
> >     _print_cred: DECODED: Thu Dec 01 16:17:19 2022
> >     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
> >     out of sync clocks
> >     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge
> >     decode failed: Rewound credential
> >     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
> >     _print_cred: ENCODED: Fri Dec 02 16:16:55 2022
> >     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
> >     _print_cred: DECODED: Thu Dec 01 16:17:20 2022
> >     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for
> >     out of sync clocks
> >     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge
> >     decode failed: Rewound credential
> >     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
> >     _print_cred: ENCODED: Fri Dec 02 16:16:56 2022
> >     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
> >     _print_cred: DECODED: Thu Dec 01 16:17:21 2022
> >     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for
> >     out of sync clocks____
> >
> >     __ __
> >
> >     When I run the following command on compute nodes I get the following
> >     output:____
> >
> >     __ __
> >
> >       [gpu101 at 101 ~]$*munge -n | unmunge*____
> >
> >     STATUS:           Success (0)
> >     ENCODE_HOST:      ??? (0.0.0.101)
> >     ENCODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)
> >     DECODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)
> >     TTL:              300
> >     CIPHER:           aes128 (4)
> >     MAC:              sha1 (3)
> >     ZIP:              none (0)
> >     UID:              gpu101 (1000)
> >     GID:              gpu101 (1000)
> >     LENGTH:           0____
> >
> >     __ __
> >
> >     Is this error because the encode_host name has question marks and the
> >     IP is also not picked correctly by munge. How can I correct this? All
> >     the nodes keep non-responding when I run a job. However, I have all
> >     the clocks synced across the cluster. ____
> >
> >     __ __
> >
> >     I am new to slurm. Kindly guide me in this matter.____
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221204/5d24fd3b/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 8292 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221204/5d24fd3b/attachment-0001.png>


More information about the slurm-users mailing list