<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Alright,</p>
    <p>I tried a few more things, but I still wasn't able to get past:
      srun: error: Unable to allocate resources: Invalid generic
      resource (gres) specification.</p>
    <p>I should mention that the node I am trying to test GPU with,
      doesn't really have a gpu, but Rob was so kind to find out that
      you do not need a gpu as long as you just link to a file in /dev/
      in the gres.conf. As mentioned: This is just for testing purposes
      - in the end we will run this on a node with a gpu, but it is not
      available at the moment.</p>
    <p><b>The error isn't changing</b></p>
    <p>If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same
      error.</p>
    <p><b>Debug Info</b></p>
    <p>I added the gpu debug flag and logged the following:</p>
    <p>[2023-07-18T14:59:45.026] restoring original state of nodes<br>
      [2023-07-18T14:59:45.026] select/cons_tres:
      part_data_create_array: select/cons_tres: preparing for 2
      partitions<br>
      [2023-07-18T14:59:45.026] error: GresPlugins changed from (null)
      to gpu ignored<br>
      [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
      change GresPlugins<br>
      [2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not
      specified<br>
      [2023-07-18T14:59:45.026] error: GresPlugins changed from (null)
      to gpu ignored<br>
      [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
      change GresPlugins<br>
      [2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure:
      select/cons_tres: reconfigure<br>
      [2023-07-18T14:59:45.027] select/cons_tres:
      part_data_create_array: select/cons_tres: preparing for 2
      partitions<br>
      [2023-07-18T14:59:45.027] No parameter for mcs plugin, default
      values set<br>
      [2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand
      set.<br>
      [2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller:
      completed usec=5898<br>
      [2023-07-18T14:59:45.952]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2</p>
    <p>I am a bit unsure what to do next to further investigate this
      issue.</p>
    <p>Best regards,<br>
      Xaver<br>
    </p>
    <div class="moz-cite-prefix">On 17.07.23 15:57, Groner, Rob wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:BL0PR02MB4499D0148BD114169F5D89AC803BA@BL0PR02MB4499.namprd02.prod.outlook.com">
      <div class="elementToProof">
        That would certainly do it.  If you look at the slurmctld log
        when it comes up, it will say that it's marking that node as
        invalid because it has less (0) gres resources then you say it
        should have.  That's because slurmd on that node will come up
        and say "What gres resources??"</div>
      <div class="elementToProof">
        <br>
      </div>
      <div class="elementToProof">
        For testing purposes,  you can just create a dummy file on the
        node, then in gres.conf, point to that file as the "graphics
        file" interface.  As long as you don't try to actually use it as
        a graphics file, that should be enough for that node to think it
        has gres/gpu resources.  That's what I do in my vagrant slurm
        cluster.</div>
      <div class="elementToProof">
        <br>
      </div>
      <div class="elementToProof">
        Rob</div>
      <div class="elementToProof">
        <br>
      </div>
      <hr tabindex="-1">
      <div id="divRplyFwdMsg" dir="ltr"><b>From:</b> slurm-users
        <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users-bounces@lists.schedmd.com"><slurm-users-bounces@lists.schedmd.com></a> on behalf of Xaver
        Stiensmeier <a class="moz-txt-link-rfc2396E" href="mailto:xaverstiensmeier@gmx.de"><xaverstiensmeier@gmx.de></a><br>
        <b>Sent:</b> Monday, July 17, 2023 9:43 AM<br>
        <b>To:</b> <a class="moz-txt-link-abbreviated" href="mailto:slurm-users@lists.schedmd.com">slurm-users@lists.schedmd.com</a>
        <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com"><slurm-users@lists.schedmd.com></a><br>
        <b>Subject:</b> Re: [slurm-users] GRES and GPUs
        <div> </div>
      </div>
      <div class="BodyFragment"><span>
          <div class="PlainText">Hi Hermann,<br>
            <br>
            Good idea, but we are already using
            `SelectType=select/cons_tres`. After<br>
            setting everything up again (in case I made an unnoticed
            mistake), I saw<br>
            that the node got marked STATE=inval.<br>
            <br>
            To be honest, I thought I can just claim that a node has a
            gpu even if<br>
            it doesn't have one - just for testing purposes. Could this
            be the issue?<br>
            <br>
            Best regards,<br>
            Xaver Stiensmeier<br>
            <br>
            On 17.07.23 14:11, Hermann Schwärzler wrote:<br>
            > Hi Xaver,<br>
            ><br>
            > what kind of SelectType are you using in your
            slurm.conf?<br>
            ><br>
            > Per <a href="https://slurm.schedmd.com/gres.html"
              moz-do-not-send="true">https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0</a>
            you have to consider:<br>
            > "As for the --gpu* option, these options are only
            supported by Slurm's<br>
            > select/cons_tres plugin."<br>
            ><br>
            > So you can use "--gpus ..." only when you state<br>
            > SelectType              = select/cons_tres<br>
            > in your slurm.conf.<br>
            ><br>
            > But "--gres=gpu:1" should work always.<br>
            ><br>
            > Regards<br>
            > Hermann<br>
            ><br>
            ><br>
            > On 7/17/23 13:43, Xaver Stiensmeier wrote:<br>
            >> Hey,<br>
            >><br>
            >> I am currently trying to understand how I can
            schedule a job that<br>
            >> needs a GPU.<br>
            >><br>
            >> I read about GRES <a
              href="https://slurm.schedmd.com/gres.html"
              moz-do-not-send="true">https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0</a>
            and tried to use:<br>
            >><br>
            >> GresTypes=gpu<br>
            >> NodeName=test Gres=gpu:1<br>
            >><br>
            >> But calling - after a 'sudo scontrol reconfigure':<br>
            >><br>
            >> srun --gpus 1 hostname<br>
            >><br>
            >> didn't work:<br>
            >><br>
            >> srun: error: Unable to allocate resources: Invalid
            generic resource<br>
            >> (gres) specification<br>
            >><br>
            >> so I read more <a
              href="https://slurm.schedmd.com/gres.conf.html"
              moz-do-not-send="true">https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.conf.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=aCh8X6QtJpRlIWxo%2BQxL85CC%2FbIo6bDxAY%2Fd5B9khmE%3D&reserved=0</a>
            but that<br>
            >> didn't really help me.<br>
            >><br>
            >><br>
            >> I am rather confused. GRES claims to be generic
            resources but then it<br>
            >> comes with three defined resources (GPU, MPS, MIG)
            and using one of<br>
            >> those didn't work in my case.<br>
            >><br>
            >> Obviously, I am misunderstanding something, but I
            am unsure where to<br>
            >> look.<br>
            >><br>
            >><br>
            >> Best regards,<br>
            >> Xaver Stiensmeier<br>
            >><br>
            ><br>
            <br>
          </div>
        </span></div>
    </blockquote>
  </body>
</html>