[slurm-users] GPU / cgroup challenges
Christopher Samuel
chris at csamuel.org
Tue May 1 18:29:16 MDT 2018
On 02/05/18 10:15, R. Paul Wiegand wrote:
> Yes, I am sure they are all the same. Typically, I just scontrol
> reconfig; however, I have also tried restarting all daemons.
Understood. Any diagnostics in the slurmd logs when trying to start
a GPU job on the node?
> We are moving to 7.4 in a few weeks during our downtime. We had a
> QDR -> OFED version constraint -> Lustre client version constraint
> issue that delayed our upgrade.
I feel your pain.. BTW RHEL 7.5 is out now so you'll need that if
you need current security fixes.
> Should I just wait and test after the upgrade?
Well 17.11.6 will be out then that will include for a deadlock
that some sites hit occasionally, so that will be worth throwing
into the mix too. Do read the RELEASE_NOTES carefully though,
especially if you're using slurmdbd!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
More information about the slurm-users
mailing list