[slurm-users] SLURM 22.05 and NHC in prolog/epilog

Bas van der Vlies bas.vandervlies at surf.nl
Fri Aug 5 10:13:59 UTC 2022


We are testing slurm 22.05 and we noticed a behaviour change for 
prolog/epilog scripts. We use NHC in the prolog/epilog to check if a 
node is healthy. In the prevous problems we had no problems 21.08.X and 
earlier.

Now when we do a srun:
  *  srun -t 1 hostname
```
srun: job 3975 queued and waiting for resources
srun: job 3975 has been allocated resources
srun: error: Nodes r16n19 are still not ready
srun: error: Something is wrong with the boot of the nodes.

```


11:57 r16n19:/tmp
root# ps -eaf | grep nhc
root       22228   22185  0 Aug03 pts/3    00:00:00 tail -f nhc.log
root       50250   20274  0 11:57 ?        00:00:00 [nhc] <defunct>
root       50259       1  0 11:57 ?        00:00:00 /bin/bash 
/usr/sbin/nhc -f FORCE_SETSID=0
root       50268       1  0 11:57 ?        00:00:00 /bin/bash 
/usr/sbin/nhc -f FORCE_SETSID=0
root       50331   48699  0 11:57 pts/5    00:00:00 grep --color=auto nhc

11:57 r16n19:/tmp
root# ps -eaf | grep 20274
root       20274       1  0 Aug03 ?        00:00:01 
/opt/slurm/sw/current/sbin/slurmd -D
root       50250   20274  0 11:57 ?        00:00:00 [nhc] <defunct>
root       50339   48699  0 11:57 pts/5    00:00:00 grep --color=auto 20274


Have other sites also have this problem? Did I miss an option?

Regards


-- 
--
Bas van der Vlies
| High Performance Computing & Visualization | SURF| Science Park 140 | 
1098 XG  Amsterdam
| T +31 (0) 20 800 1300  | bas.vandervlies at surf.nl | www.surf.nl |



More information about the slurm-users mailing list