<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Hey everyone,</p>
    <p>I am answering my own question:<br>
      It wasn't working because I need to <b>reload slurmd</b> on the
      machine, too. So the full "test gpu management without gpu"
      workflow is:</p>
    <p>1. Start your slurm cluster.<br>
      2. Add a gpu to an instance of your choice in the <b>slurm.conf</b><br>
    </p>
    <p>For example:<b><br>
      </b></p>
    <blockquote>
      <p><b>DebugFlags=GRES </b># consider this for initial setup.<br>
        <b>SelectType=select/cons_tres</b><b><br>
        </b><b>GresTypes=gpu</b><br>
        NodeName=master SocketsPerBoard=8 CoresPerSocket=1
        RealMemory=8000 <b>GRES=gpu:1</b> State=UNKNOWN</p>
    </blockquote>
    <p>3. Register it at <b>gres.conf </b>and give it <b>some file</b><br>
    </p>
    <blockquote>
      <p>NodeName=master Name=gpu File=/dev/tty0 Count=1 # count seems
        to be optional<br>
      </p>
    </blockquote>
    <p>4. Reload slurmctld (on the master) and slurmd (on the gpu node)<b><br>
      </b></p>
    <blockquote>
      <p><b>sudo systemctl restart slurmctld</b><b><br>
        </b><b>sudo systemctl restart slurmd</b></p>
    </blockquote>
    <p>I haven't tested this solution thoroughly yet, but at least
      commands like:<b><br>
      </b></p>
    <blockquote>
      <p><b>sudo systemctl restart slurmd</b> <br>
        # master</p>
    </blockquote>
    <p>run without any issues afterwards.</p>
    <p>Thank you for all your help!</p>
    <p>Best regards,<br>
      Xaver<br>
    </p>
    <div class="moz-cite-prefix">On 19.07.23 17:05, Xaver Stiensmeier
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:333fa052-33b8-11e6-ca24-4c360fca09a8@gmx.de">
      <p>Hi Hermann,</p>
      <p>count doesn't make a difference, but I noticed that when I
        reconfigure slurm and do reloads afterwards, the error "gpu
        count lower than configured" no longer appears - so maybe it is
        just because a reconfigure is needed after reloading slurmctld -
        or maybe it doesn't show the error anymore, because the node is
        still invalid? However, I still get the error:</p>
      <p>    error: _slurm_rpc_node_registration node=NName: Invalid
        argument<br>
      </p>
      <p>If I understand correctly, this is telling me that there's
        something wrong with my slurm.conf. I know that all pre-existing
        parameters are correct, so I assume it must be the gpus entry,
        but I don't see where it's wrong:</p>
      <blockquote>
        <p>NodeName=NName SocketsPerBoard=8 CoresPerSocket=1
          RealMemory=8000 Gres=gpu:1 State=CLOUD # bibiserv<br>
        </p>
      </blockquote>
      <p>Thanks for all the help,<br>
        Xaver<br>
      </p>
      <div class="moz-cite-prefix">On 19.07.23 15:04, Hermann Schwärzler
        wrote:<br>
      </div>
      <blockquote type="cite"
        cite="mid:fab27a15-fae6-ab90-d934-398516f35f4f@uibk.ac.at">Hi
        Xaver, <br>
        <br>
        I think you are missing the "Count=..." part in gres.conf <br>
        <br>
        It should read <br>
        <br>
        NodeName=NName Name=gpu File=/dev/tty0 Count=1 <br>
        <br>
        in your case. <br>
        <br>
        Regards, <br>
        Hermann <br>
        <br>
        On 7/19/23 14:19, Xaver Stiensmeier wrote: <br>
        <blockquote type="cite">Okay, <br>
          <br>
          thanks to S. Zhang I was able to figure out why nothing
          changed. While I did restart systemctld at the beginning of my
          tests, I didn't do so later, because I felt like it was
          unnecessary, but it is right there in the fourth line of the
          log that this is needed. Somehow I misread it and thought it
          automatically restarted slurmctld. <br>
          <br>
          Given the setup: <br>
          <br>
          slurm.conf <br>
          ... <br>
          GresTypes=gpu <br>
          NodeName=NName SocketsPerBoard=8 CoresPerSocket=1
          RealMemory=8000 GRES=gpu:1 State=UNKNOWN <br>
          ... <br>
          <br>
          gres.conf <br>
          NodeName=NName Name=gpu File=/dev/tty0 <br>
          <br>
          When restarting, I get the following error: <br>
          <br>
          error: Setting node NName state to INVAL with reason:gres/gpu
          count reported lower than configured (0 < 1) <br>
          <br>
          So it is still not working, but at least I get a more helpful
          log message. Because I know that this /dev/tty trick works, I
          am still unsure where the current error lies, but I will try
          to investigate it further. I am thankful for any ideas in that
          regard. <br>
          <br>
          Best regards, <br>
          Xaver <br>
          <br>
          On 19.07.23 10:23, Xaver Stiensmeier wrote: <br>
          <blockquote type="cite"> <br>
            Alright, <br>
            <br>
            I tried a few more things, but I still wasn't able to get
            past: srun: error: Unable to allocate resources: Invalid
            generic resource (gres) specification. <br>
            <br>
            I should mention that the node I am trying to test GPU with,
            doesn't really have a gpu, but Rob was so kind to find out
            that you do not need a gpu as long as you just link to a
            file in /dev/ in the gres.conf. As mentioned: This is just
            for testing purposes - in the end we will run this on a node
            with a gpu, but it is not available at the moment. <br>
            <br>
            *The error isn't changing* <br>
            <br>
            If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the
            same error. <br>
            <br>
            *Debug Info* <br>
            <br>
            I added the gpu debug flag and logged the following: <br>
            <br>
            [2023-07-18T14:59:45.026] restoring original state of nodes
            <br>
            [2023-07-18T14:59:45.026] select/cons_tres:
            part_data_create_array: select/cons_tres: preparing for 2
            partitions <br>
            [2023-07-18T14:59:45.026] error: GresPlugins changed from
            (null) to gpu ignored <br>
            [2023-07-18T14:59:45.026] error: Restart the slurmctld
            daemon to change GresPlugins <br>
            [2023-07-18T14:59:45.026] read_slurm_conf: backup_controller
            not specified <br>
            [2023-07-18T14:59:45.026] error: GresPlugins changed from
            (null) to gpu ignored <br>
            [2023-07-18T14:59:45.026] error: Restart the slurmctld
            daemon to change GresPlugins <br>
            [2023-07-18T14:59:45.026] select/cons_tres:
            select_p_reconfigure: select/cons_tres: reconfigure <br>
            [2023-07-18T14:59:45.027] select/cons_tres:
            part_data_create_array: select/cons_tres: preparing for 2
            partitions <br>
            [2023-07-18T14:59:45.027] No parameter for mcs plugin,
            default values set <br>
            [2023-07-18T14:59:45.027] mcs: MCSParameters = (null).
            ondemand set. <br>
            [2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller:
            completed usec=5898 <br>
            [2023-07-18T14:59:45.952]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2<br>
            <br>
            I am a bit unsure what to do next to further investigate
            this issue. <br>
            <br>
            Best regards, <br>
            Xaver <br>
            <br>
            On 17.07.23 15:57, Groner, Rob wrote: <br>
            <blockquote type="cite">That would certainly do it.  If you
              look at the slurmctld log when it comes up, it will say
              that it's marking that node as invalid because it has less
              (0) gres resources then you say it should have.  That's
              because slurmd on that node will come up and say "What
              gres resources??" <br>
              <br>
              For testing purposes,  you can just create a dummy file on
              the node, then in gres.conf, point to that file as the
              "graphics file" interface.  As long as you don't try to
              actually use it as a graphics file, that should be enough
              for that node to think it has gres/gpu resources. That's
              what I do in my vagrant slurm cluster. <br>
              <br>
              Rob <br>
              <br>
------------------------------------------------------------------------
              <br>
              *From:* slurm-users <a class="moz-txt-link-rfc2396E"
                href="mailto:slurm-users-bounces@lists.schedmd.com"
                moz-do-not-send="true"><slurm-users-bounces@lists.schedmd.com></a>
              on behalf of Xaver Stiensmeier <a
                class="moz-txt-link-rfc2396E"
                href="mailto:xaverstiensmeier@gmx.de"
                moz-do-not-send="true"><xaverstiensmeier@gmx.de></a>
              <br>
              *Sent:* Monday, July 17, 2023 9:43 AM <br>
              *To:* <a class="moz-txt-link-abbreviated
                moz-txt-link-freetext"
                href="mailto:slurm-users@lists.schedmd.com"
                moz-do-not-send="true">slurm-users@lists.schedmd.com</a>
              <a class="moz-txt-link-rfc2396E"
                href="mailto:slurm-users@lists.schedmd.com"
                moz-do-not-send="true"><slurm-users@lists.schedmd.com></a>
              <br>
              *Subject:* Re: [slurm-users] GRES and GPUs <br>
              Hi Hermann, <br>
              <br>
              Good idea, but we are already using
              `SelectType=select/cons_tres`. After <br>
              setting everything up again (in case I made an unnoticed
              mistake), I saw <br>
              that the node got marked STATE=inval. <br>
              <br>
              To be honest, I thought I can just claim that a node has a
              gpu even if <br>
              it doesn't have one - just for testing purposes. Could
              this be the issue? <br>
              <br>
              Best regards, <br>
              Xaver Stiensmeier <br>
              <br>
              On 17.07.23 14:11, Hermann Schwärzler wrote: <br>
              > Hi Xaver, <br>
              > <br>
              > what kind of SelectType are you using in your
              slurm.conf? <br>
              > <br>
              > Per
              <a class="moz-txt-link-freetext"
href="https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0"
                moz-do-not-send="true">https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0</a>
              <a class="moz-txt-link-rfc2396E"
                href="https://slurm.schedmd.com/gres.html"
                moz-do-not-send="true"><https://slurm.schedmd.com/gres.html></a>
              you have to consider: <br>
              > "As for the --gpu* option, these options are only
              supported by Slurm's <br>
              > select/cons_tres plugin." <br>
              > <br>
              > So you can use "--gpus ..." only when you state <br>
              > SelectType              = select/cons_tres <br>
              > in your slurm.conf. <br>
              > <br>
              > But "--gres=gpu:1" should work always. <br>
              > <br>
              > Regards <br>
              > Hermann <br>
              > <br>
              > <br>
              > On 7/17/23 13:43, Xaver Stiensmeier wrote: <br>
              >> Hey, <br>
              >> <br>
              >> I am currently trying to understand how I can
              schedule a job that <br>
              >> needs a GPU. <br>
              >> <br>
              >> I read about GRES
              <a class="moz-txt-link-freetext"
href="https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0"
                moz-do-not-send="true">https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0</a>
              <a class="moz-txt-link-rfc2396E"
                href="https://slurm.schedmd.com/gres.html"
                moz-do-not-send="true"><https://slurm.schedmd.com/gres.html></a>
              and tried to use: <br>
              >> <br>
              >> GresTypes=gpu <br>
              >> NodeName=test Gres=gpu:1 <br>
              >> <br>
              >> But calling - after a 'sudo scontrol
              reconfigure': <br>
              >> <br>
              >> srun --gpus 1 hostname <br>
              >> <br>
              >> didn't work: <br>
              >> <br>
              >> srun: error: Unable to allocate resources:
              Invalid generic resource <br>
              >> (gres) specification <br>
              >> <br>
              >> so I read more
              <a class="moz-txt-link-freetext"
href="https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.conf.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=aCh8X6QtJpRlIWxo%2BQxL85CC%2FbIo6bDxAY%2Fd5B9khmE%3D&reserved=0"
                moz-do-not-send="true">https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.conf.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=aCh8X6QtJpRlIWxo%2BQxL85CC%2FbIo6bDxAY%2Fd5B9khmE%3D&reserved=0</a>
              <a class="moz-txt-link-rfc2396E"
                href="https://slurm.schedmd.com/gres.conf.html"
                moz-do-not-send="true"><https://slurm.schedmd.com/gres.conf.html></a>
              but that <br>
              >> didn't really help me. <br>
              >> <br>
              >> <br>
              >> I am rather confused. GRES claims to be generic
              resources but then it <br>
              >> comes with three defined resources (GPU, MPS,
              MIG) and using one of <br>
              >> those didn't work in my case. <br>
              >> <br>
              >> Obviously, I am misunderstanding something, but I
              am unsure where to <br>
              >> look. <br>
              >> <br>
              >> <br>
              >> Best regards, <br>
              >> Xaver Stiensmeier <br>
              >> <br>
              > <br>
              <br>
            </blockquote>
          </blockquote>
        </blockquote>
        <br>
      </blockquote>
    </blockquote>
  </body>
</html>