We've updated to 23.11.6 and replaced MUNGE with SACK.
Performance and stability have both been pretty good, but we're
occasionally seeing this in the slurmctld.log
[2024-05-07T03:50:16.638] error: decode_jwt:
token expired at 1715053769
[2024-05-07T03:50:16.638] error: cred_p_unpack: decode_jwt()
failed
[2024-05-07T03:50:16.638] error: Malformed RPC of type
REQUEST_BATCH_JOB_LAUNCH(4005) received
[2024-05-07T03:50:16.641] error:
slurm_receive_msg_and_forward: [[headnode.internal]:58286]
failed: Header lengths are longer than data received
[2024-05-07T03:50:16.648] error: service_connection:
slurm_receive_msg: Header lengths are longer than data
received
it seems to impact a subset of nodes: jobs get killed and no new
ones are allocated.
Full functionality can be restored by simply restarting
slurmctld first, and then slurmd.
Is the token expected to actually expire? I didn't see this
possibility mentioned in the docs.
The problem occurs on an R&D cloud cluster based on EL9,
with a pretty "flat" setup.
headnode: configless slurmctld, slurmdbd, mariadb, nfsd
elastic compute nodes: autofs, slurmd
/etc/slurm/slurm.conf
AuthType=auth/slurm
AuthInfo=use_client_ids
CredType=cred/slurm
/etc/slurm/slurmdbd.conf
AuthType=auth/slurm
AuthInfo=use_client_ids
Has anyone else encountered the same error?
Thanks,
Fabio