[slurm-users] "fatal: can't stat gres.conf"

Alex Chekholko alex at calicolabs.com
Thu Jul 26 14:21:56 MDT 2018


Hello all,

My error was indeed just the comma in my gres.conf.  I was confused because
I had the same file on my running nodes but that's just because slurmd
started before the erroneous comma was added to the config.

So the error message was in fact directly correct, it could not find the
device called "/dev/nvidia[0-1],CPUs="0-9""

I have a separate question.  None of my GPUs are in 'persistence mode'.
But the users have not encountered any problems.  Reading through the docs,
it looks like it may have some minor effect on startup times.  Most of our
GPU jobs are long (many hours, sometimes days).  Do people tend to use
"persistence mode" for their GPU compute nodes?

Regards,
Alex

On Mon, Jul 23, 2018 at 7:35 PM Ryan Novosielski <novosirj at rutgers.edu>
wrote:

> > On Jul 23, 2018, at 10:31 PM, Ian Mortimer <i.mortimer at uq.edu.au> wrote:
> >
> > On Tue, 2018-07-24 at 02:19 +0000, Ryan Novosielski wrote:
> >
> >> Best off running nvidia-persistenced. Handles all of this stuff as a
> >> side effect, and also enables persistence mode, provided you don’t
> >> configure it otherwise.
> >
> > Yes.  But you have to ensure it starts before slurmd.
>
> While true, I don’t find I need to take any special precaution on my
> machines. Probably prudent to set a systemd dependency though.
>
> --
> ____
> || \\UTGERS,     |---------------------------*O*---------------------------
> ||_// the State  |         Ryan Novosielski - novosirj at rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
>      `'
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180726/dd74d1b2/attachment.html>


More information about the slurm-users mailing list