[slurm-users] sacctmgr show runawayjobs fails with slurmdbd crash

Julien Rey julien.rey at univ-paris-diderot.fr
Thu Dec 21 15:20:02 UTC 2023


Hello,

I'm sure this issue has been answered before but I'm trying to clean 
runaway jobs with:


sacctmgr -vvvv show runawayjobs


I get a very (very) long list of records and after a while the command 
crashes with the following error message:


sacctmgr: error: _conn_readable: persistent connection for fd 3 
experienced error[104]: Connection reset by peer
sacctmgr: debug2: _slurm_connect: failed to connect to 127.0.1.1:6819: 
Connection refused
sacctmgr: debug2: Error connecting slurm stream socket at 
127.0.1.1:6819: Connection refused
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open 
persistent connection to host:master1:6819: Connection refused
sacctmgr: error: Getting response to message type: MsgType=1488
sacctmgr: error: Failed to fix runaway job: Unspecified error


The slurmdbd daemons also crashes (maybe I should increase debug log level):


Dec 21 15:53:18 master1 systemd[1]: slurmdbd.service: main process 
exited, code=exited, status=1/FAILURE
Dec 21 15:53:18 master1 systemd[1]: Unit slurmdbd.service entered failed 
state.
Dec 21 15:53:18 master1 systemd[1]: slurmdbd.service failed.


I'm running slurm 21.08.8-2.


Not sure if this is related but I tried to increase 
innodb_buffer_pool_size to 32G in mysql conf, without success.


Any help would be greatly appreciated.


-- 
Julien Rey

Plate-forme RPBS
Unité BFA - CMPLI
Université de Paris
tel: 01 57 27 83 95




More information about the slurm-users mailing list