All good ideas Mick –
- I've restarted slurmd on all nodes – no effect
- Ran this on all nodes:
#!/bin/bash
uname -n
id slurm
id 59999
scontrol show config | grep SlurmUser
All show slurm being that 59999 user.
- The firewalld already has the internal network interface being used set to the trusted zone
I get a bit more info out of setting the slurmctld to debug level, but I'm not sure what to make of it TBH. I'm not sure what "_handle_mult_rc_ret: PERSIST_RC is 2002 from DBD_SEND_MULT_MSG(1474)" is trying to tell me.
Jan 10 10:48:26 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid
Jan 10 10:48:31 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid
Jan 10 10:48:35 kirby slurmctld[461138]: slurmctld: debug: accounting_storage/slurmdbd: _handle_mult_rc_ret: PERSIST_RC is 2002 from DBD_SEND_MULT_MSG(1474): DBD_SEND_MULT_MSG message from invalid uid
Jan 10 10:48:36 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid
Jan 10 10:48:41 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid
Jan 10 10:48:46 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid
Jan 10 10:48:46 kirby slurmctld[461138]: slurmctld: debug: sched/backfill: _attempt_backfill: beginning
Jan 10 10:48:46 kirby slurmctld[461138]: slurmctld: debug: sched/backfill: _attempt_backfill: no jobs to backfill
Jan 10 10:48:51 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid
Jan 10 10:48:53 kirby slurmctld[461138]: slurmctld: debug: accounting_storage/slurmdbd: _handle_mult_rc_ret: PERSIST_RC is 2002 from DBD_SEND_MULT_MSG(1474): DBD_SEND_MULT_MSG message from invalid uid
Jan 10 10:48:56 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid
Jan 10 10:49:01 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid
Jan 10 10:49:06 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid
Jan 10 10:49:11 kirby slurmctld[461138]: slurmctld: debug: accounting_storage/slurmdbd: _handle_mult_rc_ret: PERSIST_RC is 2002 from DBD_SEND_MULT_MSG(1474): DBD_SEND_MULT_MSG message from invalid uid
Jan 10 10:49:11 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid
Jan 10 10:49:16 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid
Jan 10 10:49:17 kirby slurmctld[461138]: slurmctld: debug: sched: Running job scheduler for full queue.
Jan 10 10:49:21 kirby slurmctld[461138]: slurmctld: error: DBD_SEND_MULT_JOB_START message from invalid uid
A bit more info / another possible clue. While "sacctmgr list Account" or "sacctmgr list user" shows expected account groups and users, "sreport user top start=12/1/23" and "sreport cluster utilization start=12/1/23" both report empty tables.
Craig Stark, Ph.D.
Professor, Department of Neurobiology and Behavior
Director, Facility for Imaging and Brain Research (FIBRE)
Director, Campus Center for Neuroimaging (CCNI)
School of Biological Sciences, University of California, Irvine
cestark(a)uci.edu<mailto:cestark@uci.edu>