[slurm-users] srun: Error generating job credential

Wed Oct 9 06:49:14 UTC 2019

Damn,

I almost always forget, that most of the submission part is done on the 
master :/

Best
Marcus

On 10/8/19 11:45 AM, Eddy Swan wrote:
> Hi Sean,
>
> Thank you so much for your additional information.
> The issue is indeed due to missing user on the head node.
> After i configured ldap client on slurm-master, srun command is now 
> working using ldap account.
>
> Best regards,
> Eddy Swan
>
> On Tue, Oct 8, 2019 at 4:15 PM Sean Crosby <scrosby at unimelb.edu.au 
> <mailto:scrosby at unimelb.edu.au>> wrote:
>
>     Looking at the SLURM code, it looks like it is failing with a call
>     to getpwuid_r on the ctld
>
>     What is (on slurm-master):
>
>     getent passwd turing
>     getent passwd 1000
>
>     Sean
>
>
>     --
>     Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>     Research Platform Services | Business Services
>     CoEPP Research Computing | School of Physics
>     The University of Melbourne, Victoria 3010 Australia
>
>
>     On Mon, 7 Oct 2019 at 18:36, Eddy Swan <eddys at prestolabs.io
>     <mailto:eddys at prestolabs.io>> wrote:
>
>         Hi Marcus,
>
>         pilget-17 as submit host:
>         $ id 1000
>         uid=1000(turing) gid=1000(turing)
>         groups=1000(turing),10(wheel),991(vboxusers)
>
>         piglet-18:
>         $ id 1000
>         uid=1000(turing) gid=1000(turing)
>         groups=1000(turing),10(wheel),992(vboxusers)
>
>         id 1000 is a local user for each node (piglet-17~19).
>         I also tried to submit as ldap user, but still got the same error.
>
>         Best regards,
>         Eddy Swan
>
>         On Mon, Oct 7, 2019 at 2:41 PM Marcus Wagner
>         <wagner at itc.rwth-aachen.de <mailto:wagner at itc.rwth-aachen.de>>
>         wrote:
>
>             Hi Eddy,
>
>             what is the result of "id 1000" on the submithost and on
>             piglet-18?
>
>             Best
>             Marcus
>
>             On 10/7/19 8:07 AM, Eddy Swan wrote:
>>             Hi All,
>>
>>             I am currently testing slurm version 19.05.3-2 on Centos
>>             7 with one master and 3 nodes configuration.
>>             I used the same configuration that works on version
>>             17.02.7 but for some reasons, it does not work on 19.05.3-2.
>>
>>             $ srun hostname
>>             srun: error: Unable to create step for job 19: Error
>>             generating job credential
>>             srun: Force Terminated job 19
>>
>>             If i run it as root, it works fine.
>>
>>             $ sudo srun hostname
>>             piglet-18
>>
>>             Configuration:
>>             $ cat /etc/slurm/slurm.conf
>>             # Common
>>             ControlMachine=slurm-master
>>             ControlAddr=10.15.131.32
>>             ClusterName=slurm-cluster
>>             RebootProgram="/usr/sbin/reboot"
>>
>>             MailProg=/bin/mail
>>             ProctrackType=proctrack/cgroup
>>             ReturnToService=2
>>             StateSaveLocation=/var/spool/slurmctld
>>             TaskPlugin=task/cgroup
>>
>>             # LOGGING AND ACCOUNTING
>>             AccountingStorageType=accounting_storage/filetxt
>>             AccountingStorageLoc=/var/log/slurm_acct/slurm_jobacct.log
>>             JobCompLoc=/var/log/slurm_acct/slurm_jobcomp.log
>>             JobAcctGatherType=jobacct_gather/cgroup
>>
>>             # RESOURCES
>>             MemLimitEnforce=no
>>
>>             ## Rack 1
>>             NodeName=piglet-19 NodeAddr=10.15.2.19 RealMemory=64000
>>             TmpDisk=512000 Sockets=2 CoresPerSocket=28
>>             ThreadsPerCore=1 CPUSpecList=0,1 Weight=2
>>             NodeName=piglet-18 NodeAddr=10.15.2.18 RealMemory=128000
>>             TmpDisk=512000 Sockets=2 CoresPerSocket=14
>>             ThreadsPerCore=1 CPUSpecList=0,1 Weight=2
>>             NodeName=piglet-17 NodeAddr=10.15.2.17 RealMemory=64000
>>             TmpDisk=512000 Sockets=2 CoresPerSocket=28
>>             ThreadsPerCore=1 CPUSpecList=0,1 Weight=3
>>
>>             # Preempt
>>             PreemptMode=REQUEUE
>>             PreemptType=preempt/qos
>>
>>             PartitionName=batch Nodes=ALL MaxTime=2880
>>             OverSubscribe=YES State=UP PreemptMode=REQUEUE
>>             PriorityTier=10 Default=YES
>>
>>             # TIMERS
>>             KillWait=30
>>             MinJobAge=300
>>             MessageTimeout=3
>>
>>             # SCHEDULING
>>             FastSchedule=1
>>             SchedulerType=sched/backfill
>>             SelectType=select/cons_res
>>             #SelectTypeParameters=CR_Core_Memory
>>             SelectTypeParameters=CR_CPU_Memory
>>             DefMemPerCPU=128
>>
>>             # Limit
>>             MaxArraySize=201
>>
>>             # slurmctld
>>             SlurmctldDebug=5
>>             SlurmctldLogFile=/var/log/slurm/slurmctld.log
>>             SlurmctldPidFile=/var/slurm/slurmctld.pid
>>             SlurmctldPort=6817
>>             SlurmctldTimeout=60
>>             SlurmUser=slurm
>>
>>             # slurmd
>>             SlurmdDebug=5
>>             SlurmdLogFile=/var/log/slurmd.log
>>             SlurmdPort=6818
>>             SlurmdSpoolDir=/var/spool/slurmd
>>             SlurmdTimeout=300
>>
>>             # REQUEUE
>>             #RequeueExitHold=1-199,201-255
>>             #RequeueExit=200
>>             RequeueExitHold=201-255
>>             RequeueExit=200
>>
>>             Slurmctld.log
>>             [2019-10-07T13:38:47.724] debug:  sched: Running job
>>             scheduler
>>             [2019-10-07T13:38:49.254] error: slurm_auth_get_host:
>>             Lookup failed: Unknown host
>>             [2019-10-07T13:38:49.255] sched:
>>             _slurm_rpc_allocate_resources JobId=19 NodeList=piglet-18
>>             usec=959
>>             [2019-10-07T13:38:49.259] debug:  laying out the 1 tasks
>>             on 1 hosts piglet-18 dist 2
>>             [2019-10-07T13:38:49.260] error: slurm_cred_create:
>>             getpwuid failed for uid=1000
>>             [2019-10-07T13:38:49.260] error: slurm_cred_create error
>>             [2019-10-07T13:38:49.262] _job_complete: JobId=19 WTERMSIG 1
>>             [2019-10-07T13:38:49.265] _job_complete: JobId=19 done
>>             [2019-10-07T13:38:49.270] debug:  sched: Running job
>>             scheduler
>>             [2019-10-07T13:38:56.823] debug:  sched: Running job
>>             scheduler
>>             [2019-10-07T13:39:13.504] debug:  backfill: beginning
>>             [2019-10-07T13:39:13.504] debug:  backfill: no jobs to
>>             backfill
>>             [2019-10-07T13:39:40.871] debug:  Spawning ping agent for
>>             piglet-19
>>             [2019-10-07T13:39:43.504] debug:  backfill: beginning
>>             [2019-10-07T13:39:43.504] debug:  backfill: no jobs to
>>             backfill
>>             [2019-10-07T13:39:46.999] error: slurm_auth_get_host:
>>             Lookup failed: Unknown host
>>             [2019-10-07T13:39:47.001] sched:
>>             _slurm_rpc_allocate_resources JobId=20 NodeList=piglet-18
>>             usec=979
>>             [2019-10-07T13:39:47.005] debug:  laying out the 1 tasks
>>             on 1 hosts piglet-18 dist 2
>>             [2019-10-07T13:39:47.144] _job_complete: JobId=20
>>             WEXITSTATUS 0
>>             [2019-10-07T13:39:47.147] _job_complete: JobId=20 done
>>             [2019-10-07T13:39:47.158] debug:  sched: Running job
>>             scheduler
>>             [2019-10-07T13:39:48.428] error: slurm_auth_get_host:
>>             Lookup failed: Unknown host
>>             [2019-10-07T13:39:48.429] sched:
>>             _slurm_rpc_allocate_resources JobId=21 NodeList=piglet-18
>>             usec=1114
>>             [2019-10-07T13:39:48.434] debug:  laying out the 1 tasks
>>             on 1 hosts piglet-18 dist 2
>>             [2019-10-07T13:39:48.559] _job_complete: JobId=21
>>             WEXITSTATUS 0
>>             [2019-10-07T13:39:48.560] _job_complete: JobId=21 done
>>
>>             slurmd.log on piglet-18
>>             [2019-10-07T13:38:42.746] debug:  _rpc_terminate_job, uid
>>             = 3001
>>             [2019-10-07T13:38:42.747] debug:  credential for job 17
>>             revoked
>>             [2019-10-07T13:38:47.721] debug:  _rpc_terminate_job, uid
>>             = 3001
>>             [2019-10-07T13:38:47.722] debug:  credential for job 18
>>             revoked
>>             [2019-10-07T13:38:49.267] debug:  _rpc_terminate_job, uid
>>             = 3001
>>             [2019-10-07T13:38:49.268] debug:  credential for job 19
>>             revoked
>>             [2019-10-07T13:39:47.014] launch task 20.0 request from
>>             UID:0 GID:0 HOST:10.15.2.19 PORT:62137
>>             [2019-10-07T13:39:47.014] debug:  Checking credential
>>             with 404 bytes of sig data
>>             [2019-10-07T13:39:47.016] _run_prolog: run job script
>>             took usec=7
>>             [2019-10-07T13:39:47.016] _run_prolog: prolog with lock
>>             for job 20 ran for 0 seconds
>>             [2019-10-07T13:39:47.026] debug:  AcctGatherEnergy NONE
>>             plugin loaded
>>             [2019-10-07T13:39:47.026] debug:  AcctGatherProfile NONE
>>             plugin loaded
>>             [2019-10-07T13:39:47.026] debug:  AcctGatherInterconnect
>>             NONE plugin loaded
>>             [2019-10-07T13:39:47.026] debug:  AcctGatherFilesystem
>>             NONE plugin loaded
>>             [2019-10-07T13:39:47.026] debug:  switch NONE plugin loaded
>>             [2019-10-07T13:39:47.028] [20.0] debug:  CPUs:28 Boards:1
>>             Sockets:2 CoresPerSocket:14 ThreadsPerCore:1
>>             [2019-10-07T13:39:47.028] [20.0] debug:  Job accounting
>>             gather cgroup plugin loaded
>>             [2019-10-07T13:39:47.028] [20.0] debug:  cont_id hasn't
>>             been set yet not running poll
>>             [2019-10-07T13:39:47.029] [20.0] debug:  Message thread
>>             started pid = 30331
>>             [2019-10-07T13:39:47.030] [20.0] debug:  task/cgroup: now
>>             constraining jobs allocated cores
>>             [2019-10-07T13:39:47.030] [20.0] debug:  task/cgroup: loaded
>>             [2019-10-07T13:39:47.030] [20.0] debug:  Checkpoint
>>             plugin loaded: checkpoint/none
>>             [2019-10-07T13:39:47.030] [20.0] Munge credential
>>             signature plugin loaded
>>             [2019-10-07T13:39:47.031] [20.0] debug:  job_container
>>             none plugin loaded
>>             [2019-10-07T13:39:47.031] [20.0] debug:  mpi type = none
>>             [2019-10-07T13:39:47.031] [20.0] debug:
>>              xcgroup_instantiate: cgroup
>>             '/sys/fs/cgroup/freezer/slurm' already exists
>>             [2019-10-07T13:39:47.031] [20.0] debug:  spank: opening
>>             plugin stack /etc/slurm/plugstack.conf
>>             [2019-10-07T13:39:47.031] [20.0] debug:  mpi type = (null)
>>             [2019-10-07T13:39:47.031] [20.0] debug:  mpi/none:
>>             slurmstepd prefork
>>             [2019-10-07T13:39:47.031] [20.0] debug:
>>              xcgroup_instantiate: cgroup
>>             '/sys/fs/cgroup/cpuset/slurm' already exists
>>             [2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup: job
>>             abstract cores are '2'
>>             [2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup:
>>             step abstract cores are '2'
>>             [2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup: job
>>             physical cores are '4'
>>             [2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup:
>>             step physical cores are '4'
>>             [2019-10-07T13:39:47.065] [20.0] debug level = 2
>>             [2019-10-07T13:39:47.065] [20.0] starting 1 tasks
>>             [2019-10-07T13:39:47.066] [20.0] task 0 (30336) started
>>             2019-10-07T13:39:47
>>             [2019-10-07T13:39:47.066] [20.0] debug:
>>              jobacct_gather_cgroup_cpuacct_attach_task: jobid 20
>>             stepid 0 taskid 0 max_task_id 0
>>             [2019-10-07T13:39:47.066] [20.0] debug:
>>              xcgroup_instantiate: cgroup
>>             '/sys/fs/cgroup/cpuacct/slurm' already exists
>>             [2019-10-07T13:39:47.067] [20.0] debug:
>>              jobacct_gather_cgroup_memory_attach_task: jobid 20
>>             stepid 0 taskid 0 max_task_id 0
>>             [2019-10-07T13:39:47.067] [20.0] debug:
>>              xcgroup_instantiate: cgroup
>>             '/sys/fs/cgroup/memory/slurm' already exists
>>             [2019-10-07T13:39:47.068] [20.0] debug:  IO handler
>>             started pid=30331
>>             [2019-10-07T13:39:47.099] [20.0] debug:
>>              jag_common_poll_data: Task 0 pid 30336 ave_freq =
>>             1597534 mem size/max 0/0 vmem size/max
>>             210853888/210853888, disk read size/max (0/0), disk write
>>             size/max (0/0), time 0.000000(0+0) Energy tot/max 0/0
>>             TotPower 0 MaxPower 0 MinPower 0
>>             [2019-10-07T13:39:47.101] [20.0] debug:  mpi type = (null)
>>             [2019-10-07T13:39:47.101] [20.0] debug:  Using mpi/none
>>             [2019-10-07T13:39:47.102] [20.0] debug:  CPUs:28 Boards:1
>>             Sockets:2 CoresPerSocket:14 ThreadsPerCore:1
>>             [2019-10-07T13:39:47.104] [20.0] debug:  Sending launch
>>             resp rc=0
>>             [2019-10-07T13:39:47.105] [20.0] task 0 (30336) exited
>>             with exit code 0.
>>             [2019-10-07T13:39:47.139] [20.0] debug:
>>              step_terminate_monitor_stop signaling condition
>>             [2019-10-07T13:39:47.139] [20.0] debug:  Waiting for IO
>>             [2019-10-07T13:39:47.140] [20.0] debug:  Closing debug
>>             channel
>>             [2019-10-07T13:39:47.140] [20.0] debug:  IO handler
>>             exited, rc=0
>>             [2019-10-07T13:39:47.148] [20.0] debug:  Message thread
>>             exited
>>             [2019-10-07T13:39:47.149] [20.0] done with job
>>
>>             I am not sure what i am missing. Hope someone can point
>>             out what i am doing wrong here.
>>             Thank you.
>>
>>             Best regards,
>>             Eddy Swan
>>
>
>             -- 
>             Marcus Wagner, Dipl.-Inf.
>
>             IT Center
>             Abteilung: Systeme und Betrieb
>             RWTH Aachen University
>             Seffenter Weg 23
>             52074 Aachen
>             Tel: +49 241 80-24383
>             Fax: +49 241 80-624383
>             wagner at itc.rwth-aachen.de  <mailto:wagner at itc.rwth-aachen.de>
>             www.itc.rwth-aachen.de  <http://www.itc.rwth-aachen.de>
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191009/15293137/attachment-0001.htm>