[slurm-users] How to tell SLURM to ignore specific GPUs

Paul Raines raines at nmr.mgh.harvard.edu
Thu Feb 3 15:13:05 UTC 2022


On Thu, 3 Feb 2022 1:30am, Stephan Roth wrote:

>
> On 02.02.22 18:32, Michael Di Domenico wrote:
>>  On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth <stephan.roth at ee.ethz.ch>
>>  wrote:
>>>  The problem is to identify the cards physically from the information we
>>>  have, like what's reported with nvidia-smi or available in
>>>  /proc/driver/nvidia/gpus/*/information
>>>  The serial number isn't shown for every type of GPU and I'm not sure the
>>>  ones shown match the stickers on the GPUs.
>>>  If anybody were to know of a practical solution for this, I'd be happy
>>>  to read it.
>>
>>  i hadn't seen this proc driver reference before.  checking a few of my
>>  A100's and V100's and some off hand Quadro cards, i don't see the
>>  serial number for any of them in the /proc.  sadly this would be
>>  pretty handy, does anyone know which cards do support this?  i wonder
>>  if there's some obscure something or other that needs to be turned on
>>  to dump out the serial number in /proc instead of running nvidia-smi
>
> Sorry, I didn't state cleary what I was referring to.
> I never saw the serial number in /proc/driver/nvidia/gpus/*/information, but 
> by using nvidia-smi. The information was also sometimes empty:
>
> nvidia-smi -q |grep -E '^\s+Serial Number\s+:'
>     Serial Number                   : N/A
>
> Stephan


That works fine on my boxes

[root at rtx-04 ~]# nvidia-smi -q -i 0 | grep Serial
     Serial Number                         : 1321720....




More information about the slurm-users mailing list