[slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

Robert Kudyba rkudyba at fordham.edu
Wed Apr 8 14:08:03 UTC 2020


On Wed, Apr 8, 2020 at 9:34 AM <dean.w.schulze at gmail.com> wrote:

> I believe in order to compile for nvml you'll have to compile on a system
> with an Nvidia gpu installed otherwise the Nvidia driver and libraries
> won't install on that system.
>

Yes our 3 compute nodes have 1 V100 each. So I can run:
ssh node001
Last login: Tue Apr  7 17:30:16 2020
# module load shared
# module load nccl2-cuda10.1-gcc/2.5.6
Loading nccl2-cuda10.1-gcc/2.5.6
  Loading requirement: gcc5/5.5.0 cuda10.1/toolkit/10.1.243
nvidia-smi
Wed Apr  8 10:00:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2
  |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |
 0 |
| N/A   28C    P0    25W / 250W |      0MiB / 32510MiB |      0%   E.
Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU
Memory |
|  GPU       PID   Type   Process name                             Usage
   |
|=============================================================================|
|  No running processes found
  |
+-----------------------------------------------------------------------------+


> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of
> Christopher Samuel
> > How can I get this to work by loading the correct Bright module?
>
> You can't - you will need to recompile Slurm.
>
> The error says:
>
> Apr 07 16:52:33 node001 slurmd[299181]: fatal: We were configured to
> autodetect nvml functionality, but we weren't able to find that lib when
> Slurm was configured.
>
> So when Slurm was built the libraries you are telling it to use now were
> not detected and so the configure script disabled that functionality as it
> would not otherwise have been able to compile.
>

But it's clearly there as noted in my previous reply. From
https://slurm.schedmd.com/gres.html#MPS_Management

"If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management Library
(NVML) is installed on the node and was found during Slurm configuration,
configuration details will automatically be filled in for any
system-detected NVIDIA GPU. This removes the need to explicitly configure
GPUs in gres.conf, though the Gres= line in slurm.conf is still required in
order to tell slurmctld how many GRES to expect."

So there isn't a way to have the "configuration details [will]
automatically [be] filled in for any system-detected NVIDIA GPU. "?

Also the page says this:
"By default, all system-detected devices are added to the node. However, if
Type and File in gres.conf match a GPU on the system, any other properties
explicitly specified (e.g. Cores or Links) can be double-checked against
it. If the system-detected GPU differs from its matching GPU configuration,
then the GPU is omitted from the node with an error. This allows gres.conf
to serve as an optional sanity check and notifies administrators of any
unexpected changes in GPU properties."

How does " system-detected devices" work here? How can  I get "Type and
File in gres.conf  (to) match a GPU on the system"?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200408/d4893e8e/attachment.htm>


More information about the slurm-users mailing list