[slurm-users] GPU / cgroup challenges

Christopher Samuel chris at csamuel.org
Tue May 1 18:29:16 MDT 2018


On 02/05/18 10:15, R. Paul Wiegand wrote:

> Yes, I am sure they are all the same.  Typically, I just scontrol 
> reconfig; however, I have also tried restarting all daemons.

Understood. Any diagnostics in the slurmd logs when trying to start
a GPU job on the node?

> We are moving to 7.4 in a few weeks during our downtime.  We had a
> QDR -> OFED version constraint -> Lustre client version constraint
> issue that delayed our upgrade.

I feel your pain..  BTW RHEL 7.5 is out now so you'll need that if
you need current security fixes.

> Should I just wait and test after the upgrade?

Well 17.11.6 will be out then that will include for a deadlock
that some sites hit occasionally, so that will be worth throwing
into the mix too.   Do read the RELEASE_NOTES carefully though,
especially if you're using slurmdbd!

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC



More information about the slurm-users mailing list