<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Okay,</p>
    <p>thanks to S. Zhang I was able to figure out why nothing changed.
      While I did restart systemctld at the beginning of my tests, I
      didn't do so later, because I felt like it was unnecessary, but it
      is right there in the fourth line of the log that this is needed.
      Somehow I misread it and thought it automatically restarted
      slurmctld. <br>
    </p>
    <p>Given the setup:</p>
    <p>slurm.conf<br>
      ...<br>
      GresTypes=gpu<br>
      NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000
      GRES=gpu:1 State=UNKNOWN<br>
      ...</p>
    <p>gres.conf<br>
      NodeName=NName Name=gpu File=/dev/tty0<br>
    </p>
    <p>When restarting, I get the following error:</p>
    <p>error: Setting node NName state to INVAL with reason:gres/gpu
      count reported lower than configured (0 < 1)</p>
    <p>So it is still not working, but at least I get a more helpful log
      message. Because I know that this /dev/tty trick works, I am still
      unsure where the current error lies, but I will try to investigate
      it further. I am thankful for any ideas in that regard.</p>
    <p>Best regards,<br>
      Xaver<br>
    </p>
    <div class="moz-cite-prefix">On 19.07.23 10:23, Xaver Stiensmeier
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:f8a4bbc6-b3b9-0ff3-7ab5-a7b340571dee@gmx.de">
      <p>Alright,</p>
      <p>I tried a few more things, but I still wasn't able to get past:
        srun: error: Unable to allocate resources: Invalid generic
        resource (gres) specification.</p>
      <p>I should mention that the node I am trying to test GPU with,
        doesn't really have a gpu, but Rob was so kind to find out that
        you do not need a gpu as long as you just link to a file in
        /dev/ in the gres.conf. As mentioned: This is just for testing
        purposes - in the end we will run this on a node with a gpu, but
        it is not available at the moment.</p>
      <p><b>The error isn't changing</b></p>
      <p>If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the
        same error.</p>
      <p><b>Debug Info</b></p>
      <p>I added the gpu debug flag and logged the following:</p>
      <p>[2023-07-18T14:59:45.026] restoring original state of nodes<br>
        [2023-07-18T14:59:45.026] select/cons_tres:
        part_data_create_array: select/cons_tres: preparing for 2
        partitions<br>
        [2023-07-18T14:59:45.026] error: GresPlugins changed from (null)
        to gpu ignored<br>
        [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
        change GresPlugins<br>
        [2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not
        specified<br>
        [2023-07-18T14:59:45.026] error: GresPlugins changed from (null)
        to gpu ignored<br>
        [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
        change GresPlugins<br>
        [2023-07-18T14:59:45.026] select/cons_tres:
        select_p_reconfigure: select/cons_tres: reconfigure<br>
        [2023-07-18T14:59:45.027] select/cons_tres:
        part_data_create_array: select/cons_tres: preparing for 2
        partitions<br>
        [2023-07-18T14:59:45.027] No parameter for mcs plugin, default
        values set<br>
        [2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand
        set.<br>
        [2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller:
        completed usec=5898<br>
        [2023-07-18T14:59:45.952]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2</p>
      <p>I am a bit unsure what to do next to further investigate this
        issue.</p>
      <p>Best regards,<br>
        Xaver<br>
      </p>
      <div class="moz-cite-prefix">On 17.07.23 15:57, Groner, Rob wrote:<br>
      </div>
      <blockquote type="cite"
cite="mid:BL0PR02MB4499D0148BD114169F5D89AC803BA@BL0PR02MB4499.namprd02.prod.outlook.com">
        <div class="elementToProof"> That would certainly do it.  If you
          look at the slurmctld log when it comes up, it will say that
          it's marking that node as invalid because it has less (0) gres
          resources then you say it should have.  That's because slurmd
          on that node will come up and say "What gres resources??"</div>
        <div class="elementToProof"> <br>
        </div>
        <div class="elementToProof"> For testing purposes,  you can just
          create a dummy file on the node, then in gres.conf, point to
          that file as the "graphics file" interface.  As long as you
          don't try to actually use it as a graphics file, that should
          be enough for that node to think it has gres/gpu resources. 
          That's what I do in my vagrant slurm cluster.</div>
        <div class="elementToProof"> <br>
        </div>
        <div class="elementToProof"> Rob</div>
        <div class="elementToProof"> <br>
        </div>
        <hr tabindex="-1">
        <div id="divRplyFwdMsg" dir="ltr"><b>From:</b> slurm-users <a
            class="moz-txt-link-rfc2396E"
            href="mailto:slurm-users-bounces@lists.schedmd.com"
            moz-do-not-send="true"><slurm-users-bounces@lists.schedmd.com></a>
          on behalf of Xaver Stiensmeier <a
            class="moz-txt-link-rfc2396E"
            href="mailto:xaverstiensmeier@gmx.de" moz-do-not-send="true"><xaverstiensmeier@gmx.de></a><br>
          <b>Sent:</b> Monday, July 17, 2023 9:43 AM<br>
          <b>To:</b> <a class="moz-txt-link-abbreviated
            moz-txt-link-freetext"
            href="mailto:slurm-users@lists.schedmd.com"
            moz-do-not-send="true">slurm-users@lists.schedmd.com</a> <a
            class="moz-txt-link-rfc2396E"
            href="mailto:slurm-users@lists.schedmd.com"
            moz-do-not-send="true"><slurm-users@lists.schedmd.com></a><br>
          <b>Subject:</b> Re: [slurm-users] GRES and GPUs
          <div> </div>
        </div>
        <div class="BodyFragment"><span>
            <div class="PlainText">Hi Hermann,<br>
              <br>
              Good idea, but we are already using
              `SelectType=select/cons_tres`. After<br>
              setting everything up again (in case I made an unnoticed
              mistake), I saw<br>
              that the node got marked STATE=inval.<br>
              <br>
              To be honest, I thought I can just claim that a node has a
              gpu even if<br>
              it doesn't have one - just for testing purposes. Could
              this be the issue?<br>
              <br>
              Best regards,<br>
              Xaver Stiensmeier<br>
              <br>
              On 17.07.23 14:11, Hermann Schwärzler wrote:<br>
              > Hi Xaver,<br>
              ><br>
              > what kind of SelectType are you using in your
              slurm.conf?<br>
              ><br>
              > Per <a href="https://slurm.schedmd.com/gres.html"
                moz-do-not-send="true">https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0</a>
              you have to consider:<br>
              > "As for the --gpu* option, these options are only
              supported by Slurm's<br>
              > select/cons_tres plugin."<br>
              ><br>
              > So you can use "--gpus ..." only when you state<br>
              > SelectType              = select/cons_tres<br>
              > in your slurm.conf.<br>
              ><br>
              > But "--gres=gpu:1" should work always.<br>
              ><br>
              > Regards<br>
              > Hermann<br>
              ><br>
              ><br>
              > On 7/17/23 13:43, Xaver Stiensmeier wrote:<br>
              >> Hey,<br>
              >><br>
              >> I am currently trying to understand how I can
              schedule a job that<br>
              >> needs a GPU.<br>
              >><br>
              >> I read about GRES <a
                href="https://slurm.schedmd.com/gres.html"
                moz-do-not-send="true">https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0</a>
              and tried to use:<br>
              >><br>
              >> GresTypes=gpu<br>
              >> NodeName=test Gres=gpu:1<br>
              >><br>
              >> But calling - after a 'sudo scontrol
              reconfigure':<br>
              >><br>
              >> srun --gpus 1 hostname<br>
              >><br>
              >> didn't work:<br>
              >><br>
              >> srun: error: Unable to allocate resources:
              Invalid generic resource<br>
              >> (gres) specification<br>
              >><br>
              >> so I read more <a
                href="https://slurm.schedmd.com/gres.conf.html"
                moz-do-not-send="true">https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.conf.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=aCh8X6QtJpRlIWxo%2BQxL85CC%2FbIo6bDxAY%2Fd5B9khmE%3D&reserved=0</a>
              but that<br>
              >> didn't really help me.<br>
              >><br>
              >><br>
              >> I am rather confused. GRES claims to be generic
              resources but then it<br>
              >> comes with three defined resources (GPU, MPS,
              MIG) and using one of<br>
              >> those didn't work in my case.<br>
              >><br>
              >> Obviously, I am misunderstanding something, but I
              am unsure where to<br>
              >> look.<br>
              >><br>
              >><br>
              >> Best regards,<br>
              >> Xaver Stiensmeier<br>
              >><br>
              ><br>
              <br>
            </div>
          </span></div>
      </blockquote>
    </blockquote>
  </body>
</html>