[slurm-users] sacctmgr show runawayjobs fails with slurmdbd crash
Julien Rey
julien.rey at univ-paris-diderot.fr
Thu Dec 21 15:20:02 UTC 2023
Hello,
I'm sure this issue has been answered before but I'm trying to clean
runaway jobs with:
sacctmgr -vvvv show runawayjobs
I get a very (very) long list of records and after a while the command
crashes with the following error message:
sacctmgr: error: _conn_readable: persistent connection for fd 3
experienced error[104]: Connection reset by peer
sacctmgr: debug2: _slurm_connect: failed to connect to 127.0.1.1:6819:
Connection refused
sacctmgr: debug2: Error connecting slurm stream socket at
127.0.1.1:6819: Connection refused
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to host:master1:6819: Connection refused
sacctmgr: error: Getting response to message type: MsgType=1488
sacctmgr: error: Failed to fix runaway job: Unspecified error
The slurmdbd daemons also crashes (maybe I should increase debug log level):
Dec 21 15:53:18 master1 systemd[1]: slurmdbd.service: main process
exited, code=exited, status=1/FAILURE
Dec 21 15:53:18 master1 systemd[1]: Unit slurmdbd.service entered failed
state.
Dec 21 15:53:18 master1 systemd[1]: slurmdbd.service failed.
I'm running slurm 21.08.8-2.
Not sure if this is related but I tried to increase
innodb_buffer_pool_size to 32G in mysql conf, without success.
Any help would be greatly appreciated.
--
Julien Rey
Plate-forme RPBS
Unité BFA - CMPLI
Université de Paris
tel: 01 57 27 83 95
More information about the slurm-users
mailing list