[slurm-users] How should I configure a node with Autodetect=nvml?
Dean Schulze
dean.w.schulze at gmail.com
Mon Feb 10 20:11:30 UTC 2020
In the gres.conf on one of my nodes I have just the line
Autodetect=nvml
as in the last example in https://slurm.schedmd.com/gres.conf.html.
In the slurm.conf on all nodes I have this line for the node with
Autodetect=nvml
NodeName=slurmnode1 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8
ThreadsPerCore=2 RealMemory=47671 Gres=gpu:gp100:4
since that node can have up to 4 gpus dynamically assigned. Without the
Gres=gpu:gp100:4 I can't run any job that requires a gpu even if I
dynamically assign gpus on that node. Apparently Autodetect=nvml isn't
enough to let the controller know that there are gpus available on that
node.
With this configuration I get this message every second in my slurmctld.log
file:
error: _slurm_rpc_node_registration node=slurmnode1: Invalid argument
I've restarted both slurmd and slurmctld and still get the error. That
node also stays in the drain state no matter what I do with it. Apparently
slurm doesn't like this configuration.
What is the right way to configure a node with Autodetect=nvml?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200210/c0d07e89/attachment.htm>
More information about the slurm-users
mailing list