<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Damn,<br>
<br>
I almost always forget, that most of the submission part is done on
the master :/<br>
<br>
Best<br>
Marcus<br>
<br>
<div class="moz-cite-prefix">On 10/8/19 11:45 AM, Eddy Swan wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAD8OGB7jrb-V0h776qbWmcs-95uR2PvcZrBU=F20YaX6FHSB2A@mail.gmail.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div dir="ltr">Hi Sean,
<div><br>
</div>
<div>Thank you so much for your additional information. </div>
<div>The issue is indeed due to missing user on the head node.</div>
<div>After i configured ldap client on slurm-master, srun
command is now working using ldap account.</div>
<div><br>
</div>
<div>Best regards,</div>
<div>Eddy Swan</div>
<div><br>
</div>
</div>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, Oct 8, 2019 at 4:15
PM Sean Crosby <<a href="mailto:scrosby@unimelb.edu.au"
moz-do-not-send="true">scrosby@unimelb.edu.au</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<div dir="ltr">
<div>Looking at the SLURM code, it looks like it is
failing with a call to <span>
getpwuid_r on the ctld<br>
</span></div>
<div><span><br>
</span></div>
<div><span>What is (on <font face="monospace">slurm-master</font>):<br>
</span></div>
<div><span><br>
</span></div>
<div><span>getent passwd turing</span></div>
<div><span>getent passwd 1000<br>
</span></div>
<div><span><br>
</span></div>
<div><span>Sean<br>
</span></div>
<div><span><br>
</span></div>
<div>
<div>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<div dir="ltr">--</div>
Sean Crosby | <span>Senior
DevOpsHPC Engineer and
HPC Team Lead</span><br>
Research Platform
Services | Business
Services<br>
CoEPP Research Computing
| School of Physics<br>
The University of
Melbourne, Victoria 3010
Australia<br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, 7 Oct 2019 at
18:36, Eddy Swan <<a
href="mailto:eddys@prestolabs.io" target="_blank"
moz-do-not-send="true">eddys@prestolabs.io</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px
0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">Hi Marcus,
<div><br>
</div>
<div>pilget-17 as submit host:</div>
<div><font face="monospace">$ id 1000<br>
uid=1000(turing) gid=1000(turing)
groups=1000(turing),10(wheel),991(vboxusers)</font><br>
</div>
<div><br>
</div>
<div>piglet-18:</div>
<div><font face="monospace">$ id 1000<br>
uid=1000(turing) gid=1000(turing)
groups=1000(turing),10(wheel),992(vboxusers)</font><br>
</div>
<div><br>
</div>
<div>id 1000 is a local user for each node
(piglet-17~19).</div>
<div>I also tried to submit as ldap user, but
still got the same error.</div>
<div><br>
</div>
<div>Best regards,</div>
<div>Eddy Swan</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, Oct 7,
2019 at 2:41 PM Marcus Wagner <<a
href="mailto:wagner@itc.rwth-aachen.de"
target="_blank" moz-do-not-send="true">wagner@itc.rwth-aachen.de</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">Hi Eddy,<br>
<br>
what is the result of "id 1000" on the
submithost and on piglet-18?<br>
<br>
Best<br>
Marcus<br>
<br>
<div>On 10/7/19 8:07 AM, Eddy Swan wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Hi All,
<div><br>
</div>
<div>I am currently testing slurm version
19.05.3-2 on Centos 7 with one master
and 3 nodes configuration.</div>
<div>I used the same configuration that
works on version 17.02.7 but for some
reasons, it does not work on 19.05.3-2.<br>
</div>
<div><br>
</div>
<div><font face="monospace">$ srun
hostname<br>
srun: error: Unable to create step for
job 19: Error generating job
credential<br>
srun: Force Terminated job 19</font><br>
</div>
<div><br>
</div>
<div>If i run it as root, it works fine.</div>
<div><br>
</div>
<div><font face="monospace">$ sudo srun
hostname<br>
piglet-18</font><br>
</div>
<div><br>
</div>
<div>Configuration:</div>
<div><font face="monospace">$ cat
/etc/slurm/slurm.conf<br>
# Common<br>
ControlMachine=slurm-master<br>
ControlAddr=10.15.131.32<br>
ClusterName=slurm-cluster<br>
RebootProgram="/usr/sbin/reboot"<br>
<br>
MailProg=/bin/mail<br>
ProctrackType=proctrack/cgroup<br>
ReturnToService=2<br>
StateSaveLocation=/var/spool/slurmctld<br>
TaskPlugin=task/cgroup<br>
<br>
# LOGGING AND ACCOUNTING<br>
AccountingStorageType=accounting_storage/filetxt<br>
AccountingStorageLoc=/var/log/slurm_acct/slurm_jobacct.log<br>
JobCompLoc=/var/log/slurm_acct/slurm_jobcomp.log<br>
JobAcctGatherType=jobacct_gather/cgroup<br>
<br>
# RESOURCES<br>
MemLimitEnforce=no<br>
<br>
## Rack 1<br>
NodeName=piglet-19 NodeAddr=10.15.2.19
RealMemory=64000 TmpDisk=512000
Sockets=2 CoresPerSocket=28
ThreadsPerCore=1 CPUSpecList=0,1
Weight=2<br>
NodeName=piglet-18 NodeAddr=10.15.2.18
RealMemory=128000 TmpDisk=512000
Sockets=2 CoresPerSocket=14
ThreadsPerCore=1 CPUSpecList=0,1
Weight=2<br>
NodeName=piglet-17 NodeAddr=10.15.2.17
RealMemory=64000 TmpDisk=512000
Sockets=2 CoresPerSocket=28
ThreadsPerCore=1 CPUSpecList=0,1
Weight=3<br>
<br>
# Preempt<br>
PreemptMode=REQUEUE<br>
PreemptType=preempt/qos<br>
<br>
PartitionName=batch Nodes=ALL
MaxTime=2880 OverSubscribe=YES
State=UP PreemptMode=REQUEUE
PriorityTier=10 Default=YES<br>
<br>
# TIMERS<br>
KillWait=30<br>
MinJobAge=300<br>
MessageTimeout=3<br>
<br>
# SCHEDULING<br>
FastSchedule=1<br>
SchedulerType=sched/backfill<br>
SelectType=select/cons_res<br>
#SelectTypeParameters=CR_Core_Memory<br>
SelectTypeParameters=CR_CPU_Memory<br>
DefMemPerCPU=128<br>
<br>
# Limit<br>
MaxArraySize=201<br>
<br>
# slurmctld<br>
SlurmctldDebug=5<br>
SlurmctldLogFile=/var/log/slurm/slurmctld.log<br>
SlurmctldPidFile=/var/slurm/slurmctld.pid<br>
SlurmctldPort=6817<br>
SlurmctldTimeout=60<br>
SlurmUser=slurm<br>
<br>
# slurmd<br>
SlurmdDebug=5<br>
SlurmdLogFile=/var/log/slurmd.log<br>
SlurmdPort=6818<br>
SlurmdSpoolDir=/var/spool/slurmd<br>
SlurmdTimeout=300<br>
<br>
# REQUEUE<br>
#RequeueExitHold=1-199,201-255<br>
#RequeueExit=200<br>
RequeueExitHold=201-255<br>
RequeueExit=200<br>
</font></div>
<div><font face="monospace"><br>
</font></div>
<div><font face="arial, sans-serif">Slurmctld.log </font></div>
<div><font face="monospace">[2019-10-07T13:38:47.724]
debug: sched: Running job scheduler<br>
[2019-10-07T13:38:49.254] error:
slurm_auth_get_host: Lookup failed:
Unknown host<br>
[2019-10-07T13:38:49.255] sched:
_slurm_rpc_allocate_resources JobId=19
NodeList=piglet-18 usec=959<br>
[2019-10-07T13:38:49.259] debug:
laying out the 1 tasks on 1 hosts
piglet-18 dist 2<br>
[2019-10-07T13:38:49.260] error:
slurm_cred_create: getpwuid failed for
uid=1000<br>
[2019-10-07T13:38:49.260] error:
slurm_cred_create error<br>
[2019-10-07T13:38:49.262]
_job_complete: JobId=19 WTERMSIG 1<br>
[2019-10-07T13:38:49.265]
_job_complete: JobId=19 done<br>
[2019-10-07T13:38:49.270] debug:
sched: Running job scheduler<br>
[2019-10-07T13:38:56.823] debug:
sched: Running job scheduler<br>
[2019-10-07T13:39:13.504] debug:
backfill: beginning<br>
[2019-10-07T13:39:13.504] debug:
backfill: no jobs to backfill<br>
[2019-10-07T13:39:40.871] debug:
Spawning ping agent for piglet-19<br>
[2019-10-07T13:39:43.504] debug:
backfill: beginning<br>
[2019-10-07T13:39:43.504] debug:
backfill: no jobs to backfill<br>
[2019-10-07T13:39:46.999] error:
slurm_auth_get_host: Lookup failed:
Unknown host<br>
[2019-10-07T13:39:47.001] sched:
_slurm_rpc_allocate_resources JobId=20
NodeList=piglet-18 usec=979<br>
[2019-10-07T13:39:47.005] debug:
laying out the 1 tasks on 1 hosts
piglet-18 dist 2<br>
[2019-10-07T13:39:47.144]
_job_complete: JobId=20 WEXITSTATUS 0<br>
[2019-10-07T13:39:47.147]
_job_complete: JobId=20 done<br>
[2019-10-07T13:39:47.158] debug:
sched: Running job scheduler<br>
[2019-10-07T13:39:48.428] error:
slurm_auth_get_host: Lookup failed:
Unknown host<br>
[2019-10-07T13:39:48.429] sched:
_slurm_rpc_allocate_resources JobId=21
NodeList=piglet-18 usec=1114<br>
[2019-10-07T13:39:48.434] debug:
laying out the 1 tasks on 1 hosts
piglet-18 dist 2<br>
[2019-10-07T13:39:48.559]
_job_complete: JobId=21 WEXITSTATUS 0<br>
[2019-10-07T13:39:48.560]
_job_complete: JobId=21 done<br>
</font></div>
<div><br>
</div>
<div>slurmd.log on piglet-18</div>
<div><font face="monospace">[2019-10-07T13:38:42.746]
debug: _rpc_terminate_job, uid = 3001<br>
[2019-10-07T13:38:42.747] debug:
credential for job 17 revoked<br>
[2019-10-07T13:38:47.721] debug:
_rpc_terminate_job, uid = 3001<br>
[2019-10-07T13:38:47.722] debug:
credential for job 18 revoked<br>
[2019-10-07T13:38:49.267] debug:
_rpc_terminate_job, uid = 3001<br>
[2019-10-07T13:38:49.268] debug:
credential for job 19 revoked<br>
[2019-10-07T13:39:47.014] launch task
20.0 request from UID:0 GID:0
HOST:10.15.2.19 PORT:62137<br>
[2019-10-07T13:39:47.014] debug:
Checking credential with 404 bytes of
sig data<br>
[2019-10-07T13:39:47.016] _run_prolog:
run job script took usec=7<br>
[2019-10-07T13:39:47.016] _run_prolog:
prolog with lock for job 20 ran for 0
seconds<br>
[2019-10-07T13:39:47.026] debug:
AcctGatherEnergy NONE plugin loaded<br>
[2019-10-07T13:39:47.026] debug:
AcctGatherProfile NONE plugin loaded<br>
[2019-10-07T13:39:47.026] debug:
AcctGatherInterconnect NONE plugin
loaded<br>
[2019-10-07T13:39:47.026] debug:
AcctGatherFilesystem NONE plugin
loaded<br>
[2019-10-07T13:39:47.026] debug:
switch NONE plugin loaded<br>
[2019-10-07T13:39:47.028] [20.0]
debug: CPUs:28 Boards:1 Sockets:2
CoresPerSocket:14 ThreadsPerCore:1<br>
[2019-10-07T13:39:47.028] [20.0]
debug: Job accounting gather cgroup
plugin loaded<br>
[2019-10-07T13:39:47.028] [20.0]
debug: cont_id hasn't been set yet
not running poll<br>
[2019-10-07T13:39:47.029] [20.0]
debug: Message thread started pid =
30331<br>
[2019-10-07T13:39:47.030] [20.0]
debug: task/cgroup: now constraining
jobs allocated cores<br>
[2019-10-07T13:39:47.030] [20.0]
debug: task/cgroup: loaded<br>
[2019-10-07T13:39:47.030] [20.0]
debug: Checkpoint plugin loaded:
checkpoint/none<br>
[2019-10-07T13:39:47.030] [20.0] Munge
credential signature plugin loaded<br>
[2019-10-07T13:39:47.031] [20.0]
debug: job_container none plugin
loaded<br>
[2019-10-07T13:39:47.031] [20.0]
debug: mpi type = none<br>
[2019-10-07T13:39:47.031] [20.0]
debug: xcgroup_instantiate: cgroup
'/sys/fs/cgroup/freezer/slurm' already
exists<br>
[2019-10-07T13:39:47.031] [20.0]
debug: spank: opening plugin stack
/etc/slurm/plugstack.conf<br>
[2019-10-07T13:39:47.031] [20.0]
debug: mpi type = (null)<br>
[2019-10-07T13:39:47.031] [20.0]
debug: mpi/none: slurmstepd prefork<br>
[2019-10-07T13:39:47.031] [20.0]
debug: xcgroup_instantiate: cgroup
'/sys/fs/cgroup/cpuset/slurm' already
exists<br>
[2019-10-07T13:39:47.032] [20.0]
debug: task/cgroup: job abstract
cores are '2'<br>
[2019-10-07T13:39:47.032] [20.0]
debug: task/cgroup: step abstract
cores are '2'<br>
[2019-10-07T13:39:47.032] [20.0]
debug: task/cgroup: job physical
cores are '4'<br>
[2019-10-07T13:39:47.032] [20.0]
debug: task/cgroup: step physical
cores are '4'<br>
[2019-10-07T13:39:47.065] [20.0] debug
level = 2<br>
[2019-10-07T13:39:47.065] [20.0]
starting 1 tasks<br>
[2019-10-07T13:39:47.066] [20.0] task
0 (30336) started 2019-10-07T13:39:47<br>
[2019-10-07T13:39:47.066] [20.0]
debug:
jobacct_gather_cgroup_cpuacct_attach_task:
jobid 20 stepid 0 taskid 0 max_task_id
0<br>
[2019-10-07T13:39:47.066] [20.0]
debug: xcgroup_instantiate: cgroup
'/sys/fs/cgroup/cpuacct/slurm' already
exists<br>
[2019-10-07T13:39:47.067] [20.0]
debug:
jobacct_gather_cgroup_memory_attach_task:
jobid 20 stepid 0 taskid 0 max_task_id
0<br>
[2019-10-07T13:39:47.067] [20.0]
debug: xcgroup_instantiate: cgroup
'/sys/fs/cgroup/memory/slurm' already
exists<br>
[2019-10-07T13:39:47.068] [20.0]
debug: IO handler started pid=30331<br>
[2019-10-07T13:39:47.099] [20.0]
debug: jag_common_poll_data: Task 0
pid 30336 ave_freq = 1597534 mem
size/max 0/0 vmem size/max
210853888/210853888, disk read
size/max (0/0), disk write size/max
(0/0), time 0.000000(0+0) Energy
tot/max 0/0 TotPower 0 MaxPower 0
MinPower 0<br>
[2019-10-07T13:39:47.101] [20.0]
debug: mpi type = (null)<br>
[2019-10-07T13:39:47.101] [20.0]
debug: Using mpi/none<br>
[2019-10-07T13:39:47.102] [20.0]
debug: CPUs:28 Boards:1 Sockets:2
CoresPerSocket:14 ThreadsPerCore:1<br>
[2019-10-07T13:39:47.104] [20.0]
debug: Sending launch resp rc=0<br>
[2019-10-07T13:39:47.105] [20.0] task
0 (30336) exited with exit code 0.<br>
[2019-10-07T13:39:47.139] [20.0]
debug: step_terminate_monitor_stop
signaling condition<br>
[2019-10-07T13:39:47.139] [20.0]
debug: Waiting for IO<br>
[2019-10-07T13:39:47.140] [20.0]
debug: Closing debug channel<br>
[2019-10-07T13:39:47.140] [20.0]
debug: IO handler exited, rc=0<br>
[2019-10-07T13:39:47.148] [20.0]
debug: Message thread exited<br>
[2019-10-07T13:39:47.149] [20.0] done
with job</font><br>
</div>
<div><br>
</div>
<div>I am not sure what i am missing. Hope
someone can point out what i am doing
wrong here.</div>
<div>Thank you.</div>
<div><br>
</div>
<div>Best regards,</div>
<div>Eddy Swan</div>
<div><br>
</div>
</div>
</blockquote>
<br>
<pre cols="72">--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
<a href="mailto:wagner@itc.rwth-aachen.de" target="_blank" moz-do-not-send="true">wagner@itc.rwth-aachen.de</a>
<a href="http://www.itc.rwth-aachen.de" target="_blank" moz-do-not-send="true">www.itc.rwth-aachen.de</a>
</pre>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
<a class="moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>
<a class="moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de">www.itc.rwth-aachen.de</a>
</pre>
</body>
</html>