<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hello,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
We are running Slurm v19.05.5 and I am experimenting with the <b>scontrol reboot </b>
command. I find that compute nodes reboot, but they are not returned to service. Rather they remain down following the reboot..</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
navy55 1 debug* down 80 2:20:2 192000 0 2000 (null) Reboot ASAP : reboot</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
This is a diskfull node and so it doesn't take too long to reboot. For the sake of the argument I have set ResumeTimeOut to 1000 seconds which is well over what's needed...</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<span>[root@navy51 slurm]# grep -i resume slurm.conf<br>
</span>
<div>ResumeTimeout=1000<br>
</div>
<div>[root@navy51 slurm]# grep -i return slurm.conf<br>
</div>
<div>ReturnToService=0<br>
</div>
<div>[root@navy51 slurm]# grep -i nhc slurm.conf<br>
</div>
<div># LBNL Node Health Check (NHC)<br>
</div>
<span>#HealthCheckProgram=/usr/sbin/nhc</span><br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<span><br>
</span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
For this experiment I have disabled the health checker, and I don't think setting ReturnToService=1 helps. Could anyone please help with this? We are about to update the node firmware and ensuring that the nodes are returned to service following their reboot
would be useful.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Best regards,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
David</div>
</body>
</html>