<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    Damn,<br>
    <br>
    I almost always forget, that most of the submission part is done on
    the master :/<br>
    <br>
    Best<br>
    Marcus<br>
    <br>
    <div class="moz-cite-prefix">On 10/8/19 11:45 AM, Eddy Swan wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAD8OGB7jrb-V0h776qbWmcs-95uR2PvcZrBU=F20YaX6FHSB2A@mail.gmail.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <div dir="ltr">
        <div dir="ltr">Hi Sean,
          <div><br>
          </div>
          <div>Thank you so much for your additional information. </div>
          <div>The issue is indeed due to missing user on the head node.</div>
          <div>After i configured ldap client on slurm-master, srun
            command is now working using ldap account.</div>
          <div><br>
          </div>
          <div>Best regards,</div>
          <div>Eddy Swan</div>
          <div><br>
          </div>
        </div>
        <div class="gmail_quote">
          <div dir="ltr" class="gmail_attr">On Tue, Oct 8, 2019 at 4:15
            PM Sean Crosby <<a href="mailto:scrosby@unimelb.edu.au"
              moz-do-not-send="true">scrosby@unimelb.edu.au</a>>
            wrote:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">
            <div>
              <div dir="ltr">
                <div>Looking at the SLURM code, it looks like it is
                  failing with a call to <span>
                    getpwuid_r on the ctld<br>
                  </span></div>
                <div><span><br>
                  </span></div>
                <div><span>What is (on <font face="monospace">slurm-master</font>):<br>
                  </span></div>
                <div><span><br>
                  </span></div>
                <div><span>getent passwd turing</span></div>
                <div><span>getent passwd 1000<br>
                  </span></div>
                <div><span><br>
                  </span></div>
                <div><span>Sean<br>
                  </span></div>
                <div><span><br>
                  </span></div>
                <div>
                  <div>
                    <div dir="ltr">
                      <div dir="ltr">
                        <div>
                          <div dir="ltr">
                            <div>
                              <div dir="ltr">
                                <div>
                                  <div dir="ltr">
                                    <div>
                                      <div dir="ltr">
                                        <div>
                                          <div dir="ltr">
                                            <div dir="ltr">
                                              <div dir="ltr">
                                                <div dir="ltr"><br>
                                                </div>
                                                <div dir="ltr">--</div>
                                                Sean Crosby | <span>Senior
                                                  DevOpsHPC Engineer and
                                                  HPC Team Lead</span><br>
                                                Research Platform
                                                Services | Business
                                                Services<br>
                                                CoEPP Research Computing
                                                | School of Physics<br>
                                                The University of
                                                Melbourne, Victoria 3010
                                                Australia<br>
                                              </div>
                                            </div>
                                          </div>
                                        </div>
                                      </div>
                                    </div>
                                  </div>
                                </div>
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                  <br>
                </div>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Mon, 7 Oct 2019 at
                  18:36, Eddy Swan <<a
                    href="mailto:eddys@prestolabs.io" target="_blank"
                    moz-do-not-send="true">eddys@prestolabs.io</a>>
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px
                  0px 0.8ex;border-left:1px solid
                  rgb(204,204,204);padding-left:1ex">
                  <div dir="ltr">
                    <div dir="ltr">Hi Marcus,
                      <div><br>
                      </div>
                      <div>pilget-17 as submit host:</div>
                      <div><font face="monospace">$ id 1000<br>
                          uid=1000(turing) gid=1000(turing)
                          groups=1000(turing),10(wheel),991(vboxusers)</font><br>
                      </div>
                      <div><br>
                      </div>
                      <div>piglet-18:</div>
                      <div><font face="monospace">$ id 1000<br>
                          uid=1000(turing) gid=1000(turing)
                          groups=1000(turing),10(wheel),992(vboxusers)</font><br>
                      </div>
                      <div><br>
                      </div>
                      <div>id 1000 is a local user for each node
                        (piglet-17~19).</div>
                      <div>I also tried to submit as ldap user, but
                        still got the same error.</div>
                      <div><br>
                      </div>
                      <div>Best regards,</div>
                      <div>Eddy Swan</div>
                    </div>
                    <br>
                    <div class="gmail_quote">
                      <div dir="ltr" class="gmail_attr">On Mon, Oct 7,
                        2019 at 2:41 PM Marcus Wagner <<a
                          href="mailto:wagner@itc.rwth-aachen.de"
                          target="_blank" moz-do-not-send="true">wagner@itc.rwth-aachen.de</a>>
                        wrote:<br>
                      </div>
                      <blockquote class="gmail_quote" style="margin:0px
                        0px 0px 0.8ex;border-left:1px solid
                        rgb(204,204,204);padding-left:1ex">
                        <div bgcolor="#FFFFFF">Hi Eddy,<br>
                          <br>
                          what is the result of "id 1000" on the
                          submithost and on piglet-18?<br>
                          <br>
                          Best<br>
                          Marcus<br>
                          <br>
                          <div>On 10/7/19 8:07 AM, Eddy Swan wrote:<br>
                          </div>
                          <blockquote type="cite">
                            <div dir="ltr">Hi All,
                              <div><br>
                              </div>
                              <div>I am currently testing slurm version
                                19.05.3-2 on Centos 7 with one master
                                and 3 nodes configuration.</div>
                              <div>I used the same configuration that
                                works on version 17.02.7 but for some
                                reasons, it does not work on 19.05.3-2.<br>
                              </div>
                              <div><br>
                              </div>
                              <div><font face="monospace">$ srun
                                  hostname<br>
                                  srun: error: Unable to create step for
                                  job 19: Error generating job
                                  credential<br>
                                  srun: Force Terminated job 19</font><br>
                              </div>
                              <div><br>
                              </div>
                              <div>If i run it as root, it works fine.</div>
                              <div><br>
                              </div>
                              <div><font face="monospace">$ sudo srun
                                  hostname<br>
                                  piglet-18</font><br>
                              </div>
                              <div><br>
                              </div>
                              <div>Configuration:</div>
                              <div><font face="monospace">$ cat
                                  /etc/slurm/slurm.conf<br>
                                  # Common<br>
                                  ControlMachine=slurm-master<br>
                                  ControlAddr=10.15.131.32<br>
                                  ClusterName=slurm-cluster<br>
                                  RebootProgram="/usr/sbin/reboot"<br>
                                  <br>
                                  MailProg=/bin/mail<br>
                                  ProctrackType=proctrack/cgroup<br>
                                  ReturnToService=2<br>
                                  StateSaveLocation=/var/spool/slurmctld<br>
                                  TaskPlugin=task/cgroup<br>
                                  <br>
                                  # LOGGING AND ACCOUNTING<br>
AccountingStorageType=accounting_storage/filetxt<br>
AccountingStorageLoc=/var/log/slurm_acct/slurm_jobacct.log<br>
JobCompLoc=/var/log/slurm_acct/slurm_jobcomp.log<br>
JobAcctGatherType=jobacct_gather/cgroup<br>
                                  <br>
                                  # RESOURCES<br>
                                  MemLimitEnforce=no<br>
                                  <br>
                                  ## Rack 1<br>
                                  NodeName=piglet-19 NodeAddr=10.15.2.19
                                  RealMemory=64000 TmpDisk=512000
                                  Sockets=2 CoresPerSocket=28
                                  ThreadsPerCore=1 CPUSpecList=0,1
                                  Weight=2<br>
                                  NodeName=piglet-18 NodeAddr=10.15.2.18
                                  RealMemory=128000 TmpDisk=512000
                                  Sockets=2 CoresPerSocket=14
                                  ThreadsPerCore=1 CPUSpecList=0,1
                                  Weight=2<br>
                                  NodeName=piglet-17 NodeAddr=10.15.2.17
                                  RealMemory=64000 TmpDisk=512000
                                  Sockets=2 CoresPerSocket=28
                                  ThreadsPerCore=1 CPUSpecList=0,1
                                  Weight=3<br>
                                  <br>
                                  # Preempt<br>
                                  PreemptMode=REQUEUE<br>
                                  PreemptType=preempt/qos<br>
                                  <br>
                                  PartitionName=batch Nodes=ALL
                                  MaxTime=2880 OverSubscribe=YES
                                  State=UP PreemptMode=REQUEUE
                                  PriorityTier=10 Default=YES<br>
                                  <br>
                                  # TIMERS<br>
                                  KillWait=30<br>
                                  MinJobAge=300<br>
                                  MessageTimeout=3<br>
                                  <br>
                                  # SCHEDULING<br>
                                  FastSchedule=1<br>
                                  SchedulerType=sched/backfill<br>
                                  SelectType=select/cons_res<br>
                                  #SelectTypeParameters=CR_Core_Memory<br>
                                  SelectTypeParameters=CR_CPU_Memory<br>
                                  DefMemPerCPU=128<br>
                                  <br>
                                  # Limit<br>
                                  MaxArraySize=201<br>
                                  <br>
                                  # slurmctld<br>
                                  SlurmctldDebug=5<br>
SlurmctldLogFile=/var/log/slurm/slurmctld.log<br>
SlurmctldPidFile=/var/slurm/slurmctld.pid<br>
                                  SlurmctldPort=6817<br>
                                  SlurmctldTimeout=60<br>
                                  SlurmUser=slurm<br>
                                  <br>
                                  # slurmd<br>
                                  SlurmdDebug=5<br>
                                  SlurmdLogFile=/var/log/slurmd.log<br>
                                  SlurmdPort=6818<br>
                                  SlurmdSpoolDir=/var/spool/slurmd<br>
                                  SlurmdTimeout=300<br>
                                  <br>
                                  # REQUEUE<br>
                                  #RequeueExitHold=1-199,201-255<br>
                                  #RequeueExit=200<br>
                                  RequeueExitHold=201-255<br>
                                  RequeueExit=200<br>
                                </font></div>
                              <div><font face="monospace"><br>
                                </font></div>
                              <div><font face="arial, sans-serif">Slurmctld.log </font></div>
                              <div><font face="monospace">[2019-10-07T13:38:47.724]
                                  debug:  sched: Running job scheduler<br>
                                  [2019-10-07T13:38:49.254] error:
                                  slurm_auth_get_host: Lookup failed:
                                  Unknown host<br>
                                  [2019-10-07T13:38:49.255] sched:
                                  _slurm_rpc_allocate_resources JobId=19
                                  NodeList=piglet-18 usec=959<br>
                                  [2019-10-07T13:38:49.259] debug:
                                   laying out the 1 tasks on 1 hosts
                                  piglet-18 dist 2<br>
                                  [2019-10-07T13:38:49.260] error:
                                  slurm_cred_create: getpwuid failed for
                                  uid=1000<br>
                                  [2019-10-07T13:38:49.260] error:
                                  slurm_cred_create error<br>
                                  [2019-10-07T13:38:49.262]
                                  _job_complete: JobId=19 WTERMSIG 1<br>
                                  [2019-10-07T13:38:49.265]
                                  _job_complete: JobId=19 done<br>
                                  [2019-10-07T13:38:49.270] debug:
                                   sched: Running job scheduler<br>
                                  [2019-10-07T13:38:56.823] debug:
                                   sched: Running job scheduler<br>
                                  [2019-10-07T13:39:13.504] debug:
                                   backfill: beginning<br>
                                  [2019-10-07T13:39:13.504] debug:
                                   backfill: no jobs to backfill<br>
                                  [2019-10-07T13:39:40.871] debug:
                                   Spawning ping agent for piglet-19<br>
                                  [2019-10-07T13:39:43.504] debug:
                                   backfill: beginning<br>
                                  [2019-10-07T13:39:43.504] debug:
                                   backfill: no jobs to backfill<br>
                                  [2019-10-07T13:39:46.999] error:
                                  slurm_auth_get_host: Lookup failed:
                                  Unknown host<br>
                                  [2019-10-07T13:39:47.001] sched:
                                  _slurm_rpc_allocate_resources JobId=20
                                  NodeList=piglet-18 usec=979<br>
                                  [2019-10-07T13:39:47.005] debug:
                                   laying out the 1 tasks on 1 hosts
                                  piglet-18 dist 2<br>
                                  [2019-10-07T13:39:47.144]
                                  _job_complete: JobId=20 WEXITSTATUS 0<br>
                                  [2019-10-07T13:39:47.147]
                                  _job_complete: JobId=20 done<br>
                                  [2019-10-07T13:39:47.158] debug:
                                   sched: Running job scheduler<br>
                                  [2019-10-07T13:39:48.428] error:
                                  slurm_auth_get_host: Lookup failed:
                                  Unknown host<br>
                                  [2019-10-07T13:39:48.429] sched:
                                  _slurm_rpc_allocate_resources JobId=21
                                  NodeList=piglet-18 usec=1114<br>
                                  [2019-10-07T13:39:48.434] debug:
                                   laying out the 1 tasks on 1 hosts
                                  piglet-18 dist 2<br>
                                  [2019-10-07T13:39:48.559]
                                  _job_complete: JobId=21 WEXITSTATUS 0<br>
                                  [2019-10-07T13:39:48.560]
                                  _job_complete: JobId=21 done<br>
                                </font></div>
                              <div><br>
                              </div>
                              <div>slurmd.log on piglet-18</div>
                              <div><font face="monospace">[2019-10-07T13:38:42.746]
                                  debug:  _rpc_terminate_job, uid = 3001<br>
                                  [2019-10-07T13:38:42.747] debug:
                                   credential for job 17 revoked<br>
                                  [2019-10-07T13:38:47.721] debug:
                                   _rpc_terminate_job, uid = 3001<br>
                                  [2019-10-07T13:38:47.722] debug:
                                   credential for job 18 revoked<br>
                                  [2019-10-07T13:38:49.267] debug:
                                   _rpc_terminate_job, uid = 3001<br>
                                  [2019-10-07T13:38:49.268] debug:
                                   credential for job 19 revoked<br>
                                  [2019-10-07T13:39:47.014] launch task
                                  20.0 request from UID:0 GID:0
                                  HOST:10.15.2.19 PORT:62137<br>
                                  [2019-10-07T13:39:47.014] debug:
                                   Checking credential with 404 bytes of
                                  sig data<br>
                                  [2019-10-07T13:39:47.016] _run_prolog:
                                  run job script took usec=7<br>
                                  [2019-10-07T13:39:47.016] _run_prolog:
                                  prolog with lock for job 20 ran for 0
                                  seconds<br>
                                  [2019-10-07T13:39:47.026] debug:
                                   AcctGatherEnergy NONE plugin loaded<br>
                                  [2019-10-07T13:39:47.026] debug:
                                   AcctGatherProfile NONE plugin loaded<br>
                                  [2019-10-07T13:39:47.026] debug:
                                   AcctGatherInterconnect NONE plugin
                                  loaded<br>
                                  [2019-10-07T13:39:47.026] debug:
                                   AcctGatherFilesystem NONE plugin
                                  loaded<br>
                                  [2019-10-07T13:39:47.026] debug:
                                   switch NONE plugin loaded<br>
                                  [2019-10-07T13:39:47.028] [20.0]
                                  debug:  CPUs:28 Boards:1 Sockets:2
                                  CoresPerSocket:14 ThreadsPerCore:1<br>
                                  [2019-10-07T13:39:47.028] [20.0]
                                  debug:  Job accounting gather cgroup
                                  plugin loaded<br>
                                  [2019-10-07T13:39:47.028] [20.0]
                                  debug:  cont_id hasn't been set yet
                                  not running poll<br>
                                  [2019-10-07T13:39:47.029] [20.0]
                                  debug:  Message thread started pid =
                                  30331<br>
                                  [2019-10-07T13:39:47.030] [20.0]
                                  debug:  task/cgroup: now constraining
                                  jobs allocated cores<br>
                                  [2019-10-07T13:39:47.030] [20.0]
                                  debug:  task/cgroup: loaded<br>
                                  [2019-10-07T13:39:47.030] [20.0]
                                  debug:  Checkpoint plugin loaded:
                                  checkpoint/none<br>
                                  [2019-10-07T13:39:47.030] [20.0] Munge
                                  credential signature plugin loaded<br>
                                  [2019-10-07T13:39:47.031] [20.0]
                                  debug:  job_container none plugin
                                  loaded<br>
                                  [2019-10-07T13:39:47.031] [20.0]
                                  debug:  mpi type = none<br>
                                  [2019-10-07T13:39:47.031] [20.0]
                                  debug:  xcgroup_instantiate: cgroup
                                  '/sys/fs/cgroup/freezer/slurm' already
                                  exists<br>
                                  [2019-10-07T13:39:47.031] [20.0]
                                  debug:  spank: opening plugin stack
                                  /etc/slurm/plugstack.conf<br>
                                  [2019-10-07T13:39:47.031] [20.0]
                                  debug:  mpi type = (null)<br>
                                  [2019-10-07T13:39:47.031] [20.0]
                                  debug:  mpi/none: slurmstepd prefork<br>
                                  [2019-10-07T13:39:47.031] [20.0]
                                  debug:  xcgroup_instantiate: cgroup
                                  '/sys/fs/cgroup/cpuset/slurm' already
                                  exists<br>
                                  [2019-10-07T13:39:47.032] [20.0]
                                  debug:  task/cgroup: job abstract
                                  cores are '2'<br>
                                  [2019-10-07T13:39:47.032] [20.0]
                                  debug:  task/cgroup: step abstract
                                  cores are '2'<br>
                                  [2019-10-07T13:39:47.032] [20.0]
                                  debug:  task/cgroup: job physical
                                  cores are '4'<br>
                                  [2019-10-07T13:39:47.032] [20.0]
                                  debug:  task/cgroup: step physical
                                  cores are '4'<br>
                                  [2019-10-07T13:39:47.065] [20.0] debug
                                  level = 2<br>
                                  [2019-10-07T13:39:47.065] [20.0]
                                  starting 1 tasks<br>
                                  [2019-10-07T13:39:47.066] [20.0] task
                                  0 (30336) started 2019-10-07T13:39:47<br>
                                  [2019-10-07T13:39:47.066] [20.0]
                                  debug:
                                   jobacct_gather_cgroup_cpuacct_attach_task:
                                  jobid 20 stepid 0 taskid 0 max_task_id
                                  0<br>
                                  [2019-10-07T13:39:47.066] [20.0]
                                  debug:  xcgroup_instantiate: cgroup
                                  '/sys/fs/cgroup/cpuacct/slurm' already
                                  exists<br>
                                  [2019-10-07T13:39:47.067] [20.0]
                                  debug:
                                   jobacct_gather_cgroup_memory_attach_task:
                                  jobid 20 stepid 0 taskid 0 max_task_id
                                  0<br>
                                  [2019-10-07T13:39:47.067] [20.0]
                                  debug:  xcgroup_instantiate: cgroup
                                  '/sys/fs/cgroup/memory/slurm' already
                                  exists<br>
                                  [2019-10-07T13:39:47.068] [20.0]
                                  debug:  IO handler started pid=30331<br>
                                  [2019-10-07T13:39:47.099] [20.0]
                                  debug:  jag_common_poll_data: Task 0
                                  pid 30336 ave_freq = 1597534 mem
                                  size/max 0/0 vmem size/max
                                  210853888/210853888, disk read
                                  size/max (0/0), disk write size/max
                                  (0/0), time 0.000000(0+0) Energy
                                  tot/max 0/0 TotPower 0 MaxPower 0
                                  MinPower 0<br>
                                  [2019-10-07T13:39:47.101] [20.0]
                                  debug:  mpi type = (null)<br>
                                  [2019-10-07T13:39:47.101] [20.0]
                                  debug:  Using mpi/none<br>
                                  [2019-10-07T13:39:47.102] [20.0]
                                  debug:  CPUs:28 Boards:1 Sockets:2
                                  CoresPerSocket:14 ThreadsPerCore:1<br>
                                  [2019-10-07T13:39:47.104] [20.0]
                                  debug:  Sending launch resp rc=0<br>
                                  [2019-10-07T13:39:47.105] [20.0] task
                                  0 (30336) exited with exit code 0.<br>
                                  [2019-10-07T13:39:47.139] [20.0]
                                  debug:  step_terminate_monitor_stop
                                  signaling condition<br>
                                  [2019-10-07T13:39:47.139] [20.0]
                                  debug:  Waiting for IO<br>
                                  [2019-10-07T13:39:47.140] [20.0]
                                  debug:  Closing debug channel<br>
                                  [2019-10-07T13:39:47.140] [20.0]
                                  debug:  IO handler exited, rc=0<br>
                                  [2019-10-07T13:39:47.148] [20.0]
                                  debug:  Message thread exited<br>
                                  [2019-10-07T13:39:47.149] [20.0] done
                                  with job</font><br>
                              </div>
                              <div><br>
                              </div>
                              <div>I am not sure what i am missing. Hope
                                someone can point out what i am doing
                                wrong here.</div>
                              <div>Thank you.</div>
                              <div><br>
                              </div>
                              <div>Best regards,</div>
                              <div>Eddy Swan</div>
                              <div><br>
                              </div>
                            </div>
                          </blockquote>
                          <br>
                          <pre cols="72">-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
<a href="mailto:wagner@itc.rwth-aachen.de" target="_blank" moz-do-not-send="true">wagner@itc.rwth-aachen.de</a>
<a href="http://www.itc.rwth-aachen.de" target="_blank" moz-do-not-send="true">www.itc.rwth-aachen.de</a>
</pre>
                        </div>
                      </blockquote>
                    </div>
                  </div>
                </blockquote>
              </div>
            </div>
          </blockquote>
        </div>
      </div>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
<a class="moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>
<a class="moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de">www.itc.rwth-aachen.de</a>
</pre>
  </body>
</html>