[slurm-users] DBD_SEND_MULT_MSG - invalid uid error

Craig Stark cestark at ad.uci.edu
Mon Jan 8 22:46:36 UTC 2024


This ticket with SchedMD implies it's a munged issue:

https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=1293__;!!CzAuKJ42GuquVTTmVmPViYEvSg!N2M1a84yfU8mhdQ87LnBMQxye_nBsrTzTow7spIqZaQ2dLevBDZy4oNMT8KzMsmhxdRwchIht3Tgl3p8cMHhFOg9ry546OQ_iA$

Is the munge daemon running on all systems? If it is, are all servers running a network time daemon such chronyd or ntpd and the time is in sync on all hosts?
Thanks Mick,

munge is seemingly running on all systems (systemctl status munge).  I do get a warning about the munge file changing on disk, but I'm pretty sure that's from warewulf sync'ing files every minute.  A sha256sum on the munge.key file on the compute nodes and host node says they're the same, so I think I can put that aside.

The management node runs chrony and the compute nodes sync to the management node.
[root at kirby uber]# chronyc tracking
Reference ID    : 4A06A849 (t2.time.gq1.yahoo.com)
Stratum         : 3
Ref time (UTC)  : Mon Jan 08 22:26:44 2024
System time     : 0.000032525 seconds slow of NTP time
Last offset     : -0.000021390 seconds
RMS offset      : 0.000055729 seconds
Frequency       : 38.797 ppm slow
Residual freq   : +0.001 ppm
Skew            : 0.018 ppm
Root delay      : 0.033342984 seconds
Root dispersion : 0.000524800 seconds
Update interval : 256.8 seconds
Leap status     : Normal

vs
[root at sonic01 ~]# chronyc tracking
Reference ID    : C0A80102 (warewulf)
Stratum         : 4
Ref time (UTC)  : Mon Jan 08 22:31:02 2024
System time     : 0.000000120 seconds slow of NTP time
Last offset     : -0.000000092 seconds
RMS offset      : 0.000014737 seconds
Frequency       : 47.495 ppm slow
Residual freq   : +0.000 ppm
Skew            : 0.066 ppm
Root delay      : 0.033458963 seconds
Root dispersion : 0.000283949 seconds
Update interval : 64.2 seconds
Leap status     : Normal

So, the compute node is talking to the host and the host is talking to generic NTP sources.  "date" shows the same time on the compute nodes
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20240108/bb60079d/attachment.htm>


More information about the slurm-users mailing list