[slurm-users] Nodes are down after 2-3 minutes.

Eric F. Alemany ealemany at stanford.edu
Wed May 9 09:02:36 MDT 2018


Good Morning (at least for those on the West coast of the US)

My nodes are no longer “down”

eric at radoncmaster:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      4   idle radonc[01-04]


I think the NTP configuration did the trick
So one possibility there is that the clocks are out of step between the nodes.
Usually that's configured via NTP to have a common reference source.

after ntp configuration i had to reboot the nodes, restarted and enabled ntp.
restarted slurmd on all execute nodes and slurmctl on headnote/master
then i ran:
scontrol update nodename=radonc[01-04] state=UNDRAIN
scontrol update nodename=radonc[01-04] state=IDLE

All seem good for now


Thank you everyone for your help. I learned a lot through everyone’s comments, tips and advice.

I look forward my post-docs to run their jobs. I am certain that i will have more questions by then.

Again, I greatly appreciate everyone’s help.

Cheers,
Eric


_____________________________________________________________________________________________________

Eric F.  Alemany
System Administrator for Research

Division of Radiation & Cancer  Biology
Department of Radiation Oncology

Stanford University School of Medicine
Stanford, California 94305

Tel:1-650-498-7969<tel:1-650-498-7969>  No Texting
Fax:1-650-723-7382<tel:1-650-723-7382>



On May 7, 2018, at 5:30 PM, Chris Samuel <chris at csamuel.org<mailto:chris at csamuel.org>> wrote:

On Tuesday, 8 May 2018 9:40:53 AM AEST Eric F. Alemany wrote:

I followed the link as well as the instruction on “Securing the
installation” and “Testing the installation”

Great.

The only thing that i am not able to do is:  Check if a credential can be
remotely decoded

So one possibility there is that the clocks are out of step between the nodes.
Usually that's configured via NTP to have a common reference source.

That's pretty standard as if you're running an HPC system with a distributed
filesystem like GPFS or Lustre then you need the clocks in lockstep for it to
function properly.

Good luck!
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180509/b8c626b0/attachment-0001.html>


More information about the slurm-users mailing list