[slurm-users] Nodes are down after 2-3 minutes.
Eric F. Alemany
ealemany at stanford.edu
Wed May 9 09:02:36 MDT 2018
Good Morning (at least for those on the West coast of the US)
My nodes are no longer “down”
eric at radoncmaster:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 4 idle radonc[01-04]
I think the NTP configuration did the trick
So one possibility there is that the clocks are out of step between the nodes.
Usually that's configured via NTP to have a common reference source.
after ntp configuration i had to reboot the nodes, restarted and enabled ntp.
restarted slurmd on all execute nodes and slurmctl on headnote/master
then i ran:
scontrol update nodename=radonc[01-04] state=UNDRAIN
scontrol update nodename=radonc[01-04] state=IDLE
All seem good for now
Thank you everyone for your help. I learned a lot through everyone’s comments, tips and advice.
I look forward my post-docs to run their jobs. I am certain that i will have more questions by then.
Again, I greatly appreciate everyone’s help.
Cheers,
Eric
_____________________________________________________________________________________________________
Eric F. Alemany
System Administrator for Research
Division of Radiation & Cancer Biology
Department of Radiation Oncology
Stanford University School of Medicine
Stanford, California 94305
Tel:1-650-498-7969<tel:1-650-498-7969> No Texting
Fax:1-650-723-7382<tel:1-650-723-7382>
On May 7, 2018, at 5:30 PM, Chris Samuel <chris at csamuel.org<mailto:chris at csamuel.org>> wrote:
On Tuesday, 8 May 2018 9:40:53 AM AEST Eric F. Alemany wrote:
I followed the link as well as the instruction on “Securing the
installation” and “Testing the installation”
Great.
The only thing that i am not able to do is: Check if a credential can be
remotely decoded
So one possibility there is that the clocks are out of step between the nodes.
Usually that's configured via NTP to have a common reference source.
That's pretty standard as if you're running an HPC system with a distributed
filesystem like GPFS or Lustre then you need the clocks in lockstep for it to
function properly.
Good luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180509/b8c626b0/attachment-0001.html>
More information about the slurm-users
mailing list