<div dir="ltr"><div dir="ltr">Hi Marcus,<div><br></div><div>pilget-17 as submit host:</div><div><font face="monospace">$ id 1000<br>uid=1000(turing) gid=1000(turing) groups=1000(turing),10(wheel),991(vboxusers)</font><br></div><div><br></div><div>piglet-18:</div><div><font face="monospace">$ id 1000<br>uid=1000(turing) gid=1000(turing) groups=1000(turing),10(wheel),992(vboxusers)</font><br></div><div><br></div><div>id 1000 is a local user for each node (piglet-17~19).</div><div>I also tried to submit as ldap user, but still got the same error.</div><div><br></div><div>Best regards,</div><div>Eddy Swan</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Oct 7, 2019 at 2:41 PM Marcus Wagner <<a href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF">
    Hi Eddy,<br>
    <br>
    what is the result of "id 1000" on the submithost and on piglet-18?<br>
    <br>
    Best<br>
    Marcus<br>
    <br>
    <div>On 10/7/19 8:07 AM, Eddy Swan wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">Hi All,
        <div><br>
        </div>
        <div>I am currently testing slurm version 19.05.3-2 on Centos 7
          with one master and 3 nodes configuration.</div>
        <div>I used the same configuration that works on version 17.02.7
          but for some reasons, it does not work on 19.05.3-2.<br>
        </div>
        <div><br>
        </div>
        <div><font face="monospace">$ srun hostname<br>
            srun: error: Unable to create step for job 19: Error
            generating job credential<br>
            srun: Force Terminated job 19</font><br>
        </div>
        <div><br>
        </div>
        <div>If i run it as root, it works fine.</div>
        <div><br>
        </div>
        <div><font face="monospace">$ sudo srun hostname<br>
            piglet-18</font><br>
        </div>
        <div><br>
        </div>
        <div>Configuration:</div>
        <div><font face="monospace">$ cat /etc/slurm/slurm.conf<br>
            # Common<br>
            ControlMachine=slurm-master<br>
            ControlAddr=10.15.131.32<br>
            ClusterName=slurm-cluster<br>
            RebootProgram="/usr/sbin/reboot"<br>
            <br>
            MailProg=/bin/mail<br>
            ProctrackType=proctrack/cgroup<br>
            ReturnToService=2<br>
            StateSaveLocation=/var/spool/slurmctld<br>
            TaskPlugin=task/cgroup<br>
            <br>
            # LOGGING AND ACCOUNTING<br>
            AccountingStorageType=accounting_storage/filetxt<br>
            AccountingStorageLoc=/var/log/slurm_acct/slurm_jobacct.log<br>
            JobCompLoc=/var/log/slurm_acct/slurm_jobcomp.log<br>
            JobAcctGatherType=jobacct_gather/cgroup<br>
            <br>
            # RESOURCES<br>
            MemLimitEnforce=no<br>
            <br>
            ## Rack 1<br>
            NodeName=piglet-19 NodeAddr=10.15.2.19 RealMemory=64000
            TmpDisk=512000 Sockets=2 CoresPerSocket=28 ThreadsPerCore=1
            CPUSpecList=0,1 Weight=2<br>
            NodeName=piglet-18 NodeAddr=10.15.2.18 RealMemory=128000
            TmpDisk=512000 Sockets=2 CoresPerSocket=14 ThreadsPerCore=1
            CPUSpecList=0,1 Weight=2<br>
            NodeName=piglet-17 NodeAddr=10.15.2.17 RealMemory=64000
            TmpDisk=512000 Sockets=2 CoresPerSocket=28 ThreadsPerCore=1
            CPUSpecList=0,1 Weight=3<br>
            <br>
            # Preempt<br>
            PreemptMode=REQUEUE<br>
            PreemptType=preempt/qos<br>
            <br>
            PartitionName=batch Nodes=ALL MaxTime=2880 OverSubscribe=YES
            State=UP PreemptMode=REQUEUE PriorityTier=10 Default=YES<br>
            <br>
            # TIMERS<br>
            KillWait=30<br>
            MinJobAge=300<br>
            MessageTimeout=3<br>
            <br>
            # SCHEDULING<br>
            FastSchedule=1<br>
            SchedulerType=sched/backfill<br>
            SelectType=select/cons_res<br>
            #SelectTypeParameters=CR_Core_Memory<br>
            SelectTypeParameters=CR_CPU_Memory<br>
            DefMemPerCPU=128<br>
            <br>
            # Limit<br>
            MaxArraySize=201<br>
            <br>
            # slurmctld<br>
            SlurmctldDebug=5<br>
            SlurmctldLogFile=/var/log/slurm/slurmctld.log<br>
            SlurmctldPidFile=/var/slurm/slurmctld.pid<br>
            SlurmctldPort=6817<br>
            SlurmctldTimeout=60<br>
            SlurmUser=slurm<br>
            <br>
            # slurmd<br>
            SlurmdDebug=5<br>
            SlurmdLogFile=/var/log/slurmd.log<br>
            SlurmdPort=6818<br>
            SlurmdSpoolDir=/var/spool/slurmd<br>
            SlurmdTimeout=300<br>
            <br>
            # REQUEUE<br>
            #RequeueExitHold=1-199,201-255<br>
            #RequeueExit=200<br>
            RequeueExitHold=201-255<br>
            RequeueExit=200<br>
          </font></div>
        <div><font face="monospace"><br>
          </font></div>
        <div><font face="arial, sans-serif">Slurmctld.log </font></div>
        <div><font face="monospace">[2019-10-07T13:38:47.724] debug:
             sched: Running job scheduler<br>
            [2019-10-07T13:38:49.254] error: slurm_auth_get_host: Lookup
            failed: Unknown host<br>
            [2019-10-07T13:38:49.255] sched:
            _slurm_rpc_allocate_resources JobId=19 NodeList=piglet-18
            usec=959<br>
            [2019-10-07T13:38:49.259] debug:  laying out the 1 tasks on
            1 hosts piglet-18 dist 2<br>
            [2019-10-07T13:38:49.260] error: slurm_cred_create: getpwuid
            failed for uid=1000<br>
            [2019-10-07T13:38:49.260] error: slurm_cred_create error<br>
            [2019-10-07T13:38:49.262] _job_complete: JobId=19 WTERMSIG 1<br>
            [2019-10-07T13:38:49.265] _job_complete: JobId=19 done<br>
            [2019-10-07T13:38:49.270] debug:  sched: Running job
            scheduler<br>
            [2019-10-07T13:38:56.823] debug:  sched: Running job
            scheduler<br>
            [2019-10-07T13:39:13.504] debug:  backfill: beginning<br>
            [2019-10-07T13:39:13.504] debug:  backfill: no jobs to
            backfill<br>
            [2019-10-07T13:39:40.871] debug:  Spawning ping agent for
            piglet-19<br>
            [2019-10-07T13:39:43.504] debug:  backfill: beginning<br>
            [2019-10-07T13:39:43.504] debug:  backfill: no jobs to
            backfill<br>
            [2019-10-07T13:39:46.999] error: slurm_auth_get_host: Lookup
            failed: Unknown host<br>
            [2019-10-07T13:39:47.001] sched:
            _slurm_rpc_allocate_resources JobId=20 NodeList=piglet-18
            usec=979<br>
            [2019-10-07T13:39:47.005] debug:  laying out the 1 tasks on
            1 hosts piglet-18 dist 2<br>
            [2019-10-07T13:39:47.144] _job_complete: JobId=20
            WEXITSTATUS 0<br>
            [2019-10-07T13:39:47.147] _job_complete: JobId=20 done<br>
            [2019-10-07T13:39:47.158] debug:  sched: Running job
            scheduler<br>
            [2019-10-07T13:39:48.428] error: slurm_auth_get_host: Lookup
            failed: Unknown host<br>
            [2019-10-07T13:39:48.429] sched:
            _slurm_rpc_allocate_resources JobId=21 NodeList=piglet-18
            usec=1114<br>
            [2019-10-07T13:39:48.434] debug:  laying out the 1 tasks on
            1 hosts piglet-18 dist 2<br>
            [2019-10-07T13:39:48.559] _job_complete: JobId=21
            WEXITSTATUS 0<br>
            [2019-10-07T13:39:48.560] _job_complete: JobId=21 done<br>
          </font></div>
        <div><br>
        </div>
        <div>slurmd.log on piglet-18</div>
        <div><font face="monospace">[2019-10-07T13:38:42.746] debug:
             _rpc_terminate_job, uid = 3001<br>
            [2019-10-07T13:38:42.747] debug:  credential for job 17
            revoked<br>
            [2019-10-07T13:38:47.721] debug:  _rpc_terminate_job, uid =
            3001<br>
            [2019-10-07T13:38:47.722] debug:  credential for job 18
            revoked<br>
            [2019-10-07T13:38:49.267] debug:  _rpc_terminate_job, uid =
            3001<br>
            [2019-10-07T13:38:49.268] debug:  credential for job 19
            revoked<br>
            [2019-10-07T13:39:47.014] launch task 20.0 request from
            UID:0 GID:0 HOST:10.15.2.19 PORT:62137<br>
            [2019-10-07T13:39:47.014] debug:  Checking credential with
            404 bytes of sig data<br>
            [2019-10-07T13:39:47.016] _run_prolog: run job script took
            usec=7<br>
            [2019-10-07T13:39:47.016] _run_prolog: prolog with lock for
            job 20 ran for 0 seconds<br>
            [2019-10-07T13:39:47.026] debug:  AcctGatherEnergy NONE
            plugin loaded<br>
            [2019-10-07T13:39:47.026] debug:  AcctGatherProfile NONE
            plugin loaded<br>
            [2019-10-07T13:39:47.026] debug:  AcctGatherInterconnect
            NONE plugin loaded<br>
            [2019-10-07T13:39:47.026] debug:  AcctGatherFilesystem NONE
            plugin loaded<br>
            [2019-10-07T13:39:47.026] debug:  switch NONE plugin loaded<br>
            [2019-10-07T13:39:47.028] [20.0] debug:  CPUs:28 Boards:1
            Sockets:2 CoresPerSocket:14 ThreadsPerCore:1<br>
            [2019-10-07T13:39:47.028] [20.0] debug:  Job accounting
            gather cgroup plugin loaded<br>
            [2019-10-07T13:39:47.028] [20.0] debug:  cont_id hasn't been
            set yet not running poll<br>
            [2019-10-07T13:39:47.029] [20.0] debug:  Message thread
            started pid = 30331<br>
            [2019-10-07T13:39:47.030] [20.0] debug:  task/cgroup: now
            constraining jobs allocated cores<br>
            [2019-10-07T13:39:47.030] [20.0] debug:  task/cgroup: loaded<br>
            [2019-10-07T13:39:47.030] [20.0] debug:  Checkpoint plugin
            loaded: checkpoint/none<br>
            [2019-10-07T13:39:47.030] [20.0] Munge credential signature
            plugin loaded<br>
            [2019-10-07T13:39:47.031] [20.0] debug:  job_container none
            plugin loaded<br>
            [2019-10-07T13:39:47.031] [20.0] debug:  mpi type = none<br>
            [2019-10-07T13:39:47.031] [20.0] debug:
             xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm'
            already exists<br>
            [2019-10-07T13:39:47.031] [20.0] debug:  spank: opening
            plugin stack /etc/slurm/plugstack.conf<br>
            [2019-10-07T13:39:47.031] [20.0] debug:  mpi type = (null)<br>
            [2019-10-07T13:39:47.031] [20.0] debug:  mpi/none:
            slurmstepd prefork<br>
            [2019-10-07T13:39:47.031] [20.0] debug:
             xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm'
            already exists<br>
            [2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup: job
            abstract cores are '2'<br>
            [2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup: step
            abstract cores are '2'<br>
            [2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup: job
            physical cores are '4'<br>
            [2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup: step
            physical cores are '4'<br>
            [2019-10-07T13:39:47.065] [20.0] debug level = 2<br>
            [2019-10-07T13:39:47.065] [20.0] starting 1 tasks<br>
            [2019-10-07T13:39:47.066] [20.0] task 0 (30336) started
            2019-10-07T13:39:47<br>
            [2019-10-07T13:39:47.066] [20.0] debug:
             jobacct_gather_cgroup_cpuacct_attach_task: jobid 20 stepid
            0 taskid 0 max_task_id 0<br>
            [2019-10-07T13:39:47.066] [20.0] debug:
             xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm'
            already exists<br>
            [2019-10-07T13:39:47.067] [20.0] debug:
             jobacct_gather_cgroup_memory_attach_task: jobid 20 stepid 0
            taskid 0 max_task_id 0<br>
            [2019-10-07T13:39:47.067] [20.0] debug:
             xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm'
            already exists<br>
            [2019-10-07T13:39:47.068] [20.0] debug:  IO handler started
            pid=30331<br>
            [2019-10-07T13:39:47.099] [20.0] debug:
             jag_common_poll_data: Task 0 pid 30336 ave_freq = 1597534
            mem size/max 0/0 vmem size/max 210853888/210853888, disk
            read size/max (0/0), disk write size/max (0/0), time
            0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0
            MinPower 0<br>
            [2019-10-07T13:39:47.101] [20.0] debug:  mpi type = (null)<br>
            [2019-10-07T13:39:47.101] [20.0] debug:  Using mpi/none<br>
            [2019-10-07T13:39:47.102] [20.0] debug:  CPUs:28 Boards:1
            Sockets:2 CoresPerSocket:14 ThreadsPerCore:1<br>
            [2019-10-07T13:39:47.104] [20.0] debug:  Sending launch resp
            rc=0<br>
            [2019-10-07T13:39:47.105] [20.0] task 0 (30336) exited with
            exit code 0.<br>
            [2019-10-07T13:39:47.139] [20.0] debug:
             step_terminate_monitor_stop signaling condition<br>
            [2019-10-07T13:39:47.139] [20.0] debug:  Waiting for IO<br>
            [2019-10-07T13:39:47.140] [20.0] debug:  Closing debug
            channel<br>
            [2019-10-07T13:39:47.140] [20.0] debug:  IO handler exited,
            rc=0<br>
            [2019-10-07T13:39:47.148] [20.0] debug:  Message thread
            exited<br>
            [2019-10-07T13:39:47.149] [20.0] done with job</font><br>
        </div>
        <div><br>
        </div>
        <div>I am not sure what i am missing. Hope someone can point out
          what i am doing wrong here.</div>
        <div>Thank you.</div>
        <div><br>
        </div>
        <div>Best regards,</div>
        <div>Eddy Swan</div>
        <div><br>
        </div>
      </div>
    </blockquote>
    <br>
    <pre cols="72">-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
<a href="mailto:wagner@itc.rwth-aachen.de" target="_blank">wagner@itc.rwth-aachen.de</a>
<a href="http://www.itc.rwth-aachen.de" target="_blank">www.itc.rwth-aachen.de</a>
</pre>
  </div>

</blockquote></div></div>