[slurm-users] srun: Error generating job credential
Marcus Wagner
wagner at itc.rwth-aachen.de
Wed Oct 9 06:49:14 UTC 2019
Damn,
I almost always forget, that most of the submission part is done on the
master :/
Best
Marcus
On 10/8/19 11:45 AM, Eddy Swan wrote:
> Hi Sean,
>
> Thank you so much for your additional information.
> The issue is indeed due to missing user on the head node.
> After i configured ldap client on slurm-master, srun command is now
> working using ldap account.
>
> Best regards,
> Eddy Swan
>
> On Tue, Oct 8, 2019 at 4:15 PM Sean Crosby <scrosby at unimelb.edu.au
> <mailto:scrosby at unimelb.edu.au>> wrote:
>
> Looking at the SLURM code, it looks like it is failing with a call
> to getpwuid_r on the ctld
>
> What is (on slurm-master):
>
> getent passwd turing
> getent passwd 1000
>
> Sean
>
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Platform Services | Business Services
> CoEPP Research Computing | School of Physics
> The University of Melbourne, Victoria 3010 Australia
>
>
> On Mon, 7 Oct 2019 at 18:36, Eddy Swan <eddys at prestolabs.io
> <mailto:eddys at prestolabs.io>> wrote:
>
> Hi Marcus,
>
> pilget-17 as submit host:
> $ id 1000
> uid=1000(turing) gid=1000(turing)
> groups=1000(turing),10(wheel),991(vboxusers)
>
> piglet-18:
> $ id 1000
> uid=1000(turing) gid=1000(turing)
> groups=1000(turing),10(wheel),992(vboxusers)
>
> id 1000 is a local user for each node (piglet-17~19).
> I also tried to submit as ldap user, but still got the same error.
>
> Best regards,
> Eddy Swan
>
> On Mon, Oct 7, 2019 at 2:41 PM Marcus Wagner
> <wagner at itc.rwth-aachen.de <mailto:wagner at itc.rwth-aachen.de>>
> wrote:
>
> Hi Eddy,
>
> what is the result of "id 1000" on the submithost and on
> piglet-18?
>
> Best
> Marcus
>
> On 10/7/19 8:07 AM, Eddy Swan wrote:
>> Hi All,
>>
>> I am currently testing slurm version 19.05.3-2 on Centos
>> 7 with one master and 3 nodes configuration.
>> I used the same configuration that works on version
>> 17.02.7 but for some reasons, it does not work on 19.05.3-2.
>>
>> $ srun hostname
>> srun: error: Unable to create step for job 19: Error
>> generating job credential
>> srun: Force Terminated job 19
>>
>> If i run it as root, it works fine.
>>
>> $ sudo srun hostname
>> piglet-18
>>
>> Configuration:
>> $ cat /etc/slurm/slurm.conf
>> # Common
>> ControlMachine=slurm-master
>> ControlAddr=10.15.131.32
>> ClusterName=slurm-cluster
>> RebootProgram="/usr/sbin/reboot"
>>
>> MailProg=/bin/mail
>> ProctrackType=proctrack/cgroup
>> ReturnToService=2
>> StateSaveLocation=/var/spool/slurmctld
>> TaskPlugin=task/cgroup
>>
>> # LOGGING AND ACCOUNTING
>> AccountingStorageType=accounting_storage/filetxt
>> AccountingStorageLoc=/var/log/slurm_acct/slurm_jobacct.log
>> JobCompLoc=/var/log/slurm_acct/slurm_jobcomp.log
>> JobAcctGatherType=jobacct_gather/cgroup
>>
>> # RESOURCES
>> MemLimitEnforce=no
>>
>> ## Rack 1
>> NodeName=piglet-19 NodeAddr=10.15.2.19 RealMemory=64000
>> TmpDisk=512000 Sockets=2 CoresPerSocket=28
>> ThreadsPerCore=1 CPUSpecList=0,1 Weight=2
>> NodeName=piglet-18 NodeAddr=10.15.2.18 RealMemory=128000
>> TmpDisk=512000 Sockets=2 CoresPerSocket=14
>> ThreadsPerCore=1 CPUSpecList=0,1 Weight=2
>> NodeName=piglet-17 NodeAddr=10.15.2.17 RealMemory=64000
>> TmpDisk=512000 Sockets=2 CoresPerSocket=28
>> ThreadsPerCore=1 CPUSpecList=0,1 Weight=3
>>
>> # Preempt
>> PreemptMode=REQUEUE
>> PreemptType=preempt/qos
>>
>> PartitionName=batch Nodes=ALL MaxTime=2880
>> OverSubscribe=YES State=UP PreemptMode=REQUEUE
>> PriorityTier=10 Default=YES
>>
>> # TIMERS
>> KillWait=30
>> MinJobAge=300
>> MessageTimeout=3
>>
>> # SCHEDULING
>> FastSchedule=1
>> SchedulerType=sched/backfill
>> SelectType=select/cons_res
>> #SelectTypeParameters=CR_Core_Memory
>> SelectTypeParameters=CR_CPU_Memory
>> DefMemPerCPU=128
>>
>> # Limit
>> MaxArraySize=201
>>
>> # slurmctld
>> SlurmctldDebug=5
>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>> SlurmctldPidFile=/var/slurm/slurmctld.pid
>> SlurmctldPort=6817
>> SlurmctldTimeout=60
>> SlurmUser=slurm
>>
>> # slurmd
>> SlurmdDebug=5
>> SlurmdLogFile=/var/log/slurmd.log
>> SlurmdPort=6818
>> SlurmdSpoolDir=/var/spool/slurmd
>> SlurmdTimeout=300
>>
>> # REQUEUE
>> #RequeueExitHold=1-199,201-255
>> #RequeueExit=200
>> RequeueExitHold=201-255
>> RequeueExit=200
>>
>> Slurmctld.log
>> [2019-10-07T13:38:47.724] debug: sched: Running job
>> scheduler
>> [2019-10-07T13:38:49.254] error: slurm_auth_get_host:
>> Lookup failed: Unknown host
>> [2019-10-07T13:38:49.255] sched:
>> _slurm_rpc_allocate_resources JobId=19 NodeList=piglet-18
>> usec=959
>> [2019-10-07T13:38:49.259] debug: laying out the 1 tasks
>> on 1 hosts piglet-18 dist 2
>> [2019-10-07T13:38:49.260] error: slurm_cred_create:
>> getpwuid failed for uid=1000
>> [2019-10-07T13:38:49.260] error: slurm_cred_create error
>> [2019-10-07T13:38:49.262] _job_complete: JobId=19 WTERMSIG 1
>> [2019-10-07T13:38:49.265] _job_complete: JobId=19 done
>> [2019-10-07T13:38:49.270] debug: sched: Running job
>> scheduler
>> [2019-10-07T13:38:56.823] debug: sched: Running job
>> scheduler
>> [2019-10-07T13:39:13.504] debug: backfill: beginning
>> [2019-10-07T13:39:13.504] debug: backfill: no jobs to
>> backfill
>> [2019-10-07T13:39:40.871] debug: Spawning ping agent for
>> piglet-19
>> [2019-10-07T13:39:43.504] debug: backfill: beginning
>> [2019-10-07T13:39:43.504] debug: backfill: no jobs to
>> backfill
>> [2019-10-07T13:39:46.999] error: slurm_auth_get_host:
>> Lookup failed: Unknown host
>> [2019-10-07T13:39:47.001] sched:
>> _slurm_rpc_allocate_resources JobId=20 NodeList=piglet-18
>> usec=979
>> [2019-10-07T13:39:47.005] debug: laying out the 1 tasks
>> on 1 hosts piglet-18 dist 2
>> [2019-10-07T13:39:47.144] _job_complete: JobId=20
>> WEXITSTATUS 0
>> [2019-10-07T13:39:47.147] _job_complete: JobId=20 done
>> [2019-10-07T13:39:47.158] debug: sched: Running job
>> scheduler
>> [2019-10-07T13:39:48.428] error: slurm_auth_get_host:
>> Lookup failed: Unknown host
>> [2019-10-07T13:39:48.429] sched:
>> _slurm_rpc_allocate_resources JobId=21 NodeList=piglet-18
>> usec=1114
>> [2019-10-07T13:39:48.434] debug: laying out the 1 tasks
>> on 1 hosts piglet-18 dist 2
>> [2019-10-07T13:39:48.559] _job_complete: JobId=21
>> WEXITSTATUS 0
>> [2019-10-07T13:39:48.560] _job_complete: JobId=21 done
>>
>> slurmd.log on piglet-18
>> [2019-10-07T13:38:42.746] debug: _rpc_terminate_job, uid
>> = 3001
>> [2019-10-07T13:38:42.747] debug: credential for job 17
>> revoked
>> [2019-10-07T13:38:47.721] debug: _rpc_terminate_job, uid
>> = 3001
>> [2019-10-07T13:38:47.722] debug: credential for job 18
>> revoked
>> [2019-10-07T13:38:49.267] debug: _rpc_terminate_job, uid
>> = 3001
>> [2019-10-07T13:38:49.268] debug: credential for job 19
>> revoked
>> [2019-10-07T13:39:47.014] launch task 20.0 request from
>> UID:0 GID:0 HOST:10.15.2.19 PORT:62137
>> [2019-10-07T13:39:47.014] debug: Checking credential
>> with 404 bytes of sig data
>> [2019-10-07T13:39:47.016] _run_prolog: run job script
>> took usec=7
>> [2019-10-07T13:39:47.016] _run_prolog: prolog with lock
>> for job 20 ran for 0 seconds
>> [2019-10-07T13:39:47.026] debug: AcctGatherEnergy NONE
>> plugin loaded
>> [2019-10-07T13:39:47.026] debug: AcctGatherProfile NONE
>> plugin loaded
>> [2019-10-07T13:39:47.026] debug: AcctGatherInterconnect
>> NONE plugin loaded
>> [2019-10-07T13:39:47.026] debug: AcctGatherFilesystem
>> NONE plugin loaded
>> [2019-10-07T13:39:47.026] debug: switch NONE plugin loaded
>> [2019-10-07T13:39:47.028] [20.0] debug: CPUs:28 Boards:1
>> Sockets:2 CoresPerSocket:14 ThreadsPerCore:1
>> [2019-10-07T13:39:47.028] [20.0] debug: Job accounting
>> gather cgroup plugin loaded
>> [2019-10-07T13:39:47.028] [20.0] debug: cont_id hasn't
>> been set yet not running poll
>> [2019-10-07T13:39:47.029] [20.0] debug: Message thread
>> started pid = 30331
>> [2019-10-07T13:39:47.030] [20.0] debug: task/cgroup: now
>> constraining jobs allocated cores
>> [2019-10-07T13:39:47.030] [20.0] debug: task/cgroup: loaded
>> [2019-10-07T13:39:47.030] [20.0] debug: Checkpoint
>> plugin loaded: checkpoint/none
>> [2019-10-07T13:39:47.030] [20.0] Munge credential
>> signature plugin loaded
>> [2019-10-07T13:39:47.031] [20.0] debug: job_container
>> none plugin loaded
>> [2019-10-07T13:39:47.031] [20.0] debug: mpi type = none
>> [2019-10-07T13:39:47.031] [20.0] debug:
>> xcgroup_instantiate: cgroup
>> '/sys/fs/cgroup/freezer/slurm' already exists
>> [2019-10-07T13:39:47.031] [20.0] debug: spank: opening
>> plugin stack /etc/slurm/plugstack.conf
>> [2019-10-07T13:39:47.031] [20.0] debug: mpi type = (null)
>> [2019-10-07T13:39:47.031] [20.0] debug: mpi/none:
>> slurmstepd prefork
>> [2019-10-07T13:39:47.031] [20.0] debug:
>> xcgroup_instantiate: cgroup
>> '/sys/fs/cgroup/cpuset/slurm' already exists
>> [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: job
>> abstract cores are '2'
>> [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup:
>> step abstract cores are '2'
>> [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: job
>> physical cores are '4'
>> [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup:
>> step physical cores are '4'
>> [2019-10-07T13:39:47.065] [20.0] debug level = 2
>> [2019-10-07T13:39:47.065] [20.0] starting 1 tasks
>> [2019-10-07T13:39:47.066] [20.0] task 0 (30336) started
>> 2019-10-07T13:39:47
>> [2019-10-07T13:39:47.066] [20.0] debug:
>> jobacct_gather_cgroup_cpuacct_attach_task: jobid 20
>> stepid 0 taskid 0 max_task_id 0
>> [2019-10-07T13:39:47.066] [20.0] debug:
>> xcgroup_instantiate: cgroup
>> '/sys/fs/cgroup/cpuacct/slurm' already exists
>> [2019-10-07T13:39:47.067] [20.0] debug:
>> jobacct_gather_cgroup_memory_attach_task: jobid 20
>> stepid 0 taskid 0 max_task_id 0
>> [2019-10-07T13:39:47.067] [20.0] debug:
>> xcgroup_instantiate: cgroup
>> '/sys/fs/cgroup/memory/slurm' already exists
>> [2019-10-07T13:39:47.068] [20.0] debug: IO handler
>> started pid=30331
>> [2019-10-07T13:39:47.099] [20.0] debug:
>> jag_common_poll_data: Task 0 pid 30336 ave_freq =
>> 1597534 mem size/max 0/0 vmem size/max
>> 210853888/210853888, disk read size/max (0/0), disk write
>> size/max (0/0), time 0.000000(0+0) Energy tot/max 0/0
>> TotPower 0 MaxPower 0 MinPower 0
>> [2019-10-07T13:39:47.101] [20.0] debug: mpi type = (null)
>> [2019-10-07T13:39:47.101] [20.0] debug: Using mpi/none
>> [2019-10-07T13:39:47.102] [20.0] debug: CPUs:28 Boards:1
>> Sockets:2 CoresPerSocket:14 ThreadsPerCore:1
>> [2019-10-07T13:39:47.104] [20.0] debug: Sending launch
>> resp rc=0
>> [2019-10-07T13:39:47.105] [20.0] task 0 (30336) exited
>> with exit code 0.
>> [2019-10-07T13:39:47.139] [20.0] debug:
>> step_terminate_monitor_stop signaling condition
>> [2019-10-07T13:39:47.139] [20.0] debug: Waiting for IO
>> [2019-10-07T13:39:47.140] [20.0] debug: Closing debug
>> channel
>> [2019-10-07T13:39:47.140] [20.0] debug: IO handler
>> exited, rc=0
>> [2019-10-07T13:39:47.148] [20.0] debug: Message thread
>> exited
>> [2019-10-07T13:39:47.149] [20.0] done with job
>>
>> I am not sure what i am missing. Hope someone can point
>> out what i am doing wrong here.
>> Thank you.
>>
>> Best regards,
>> Eddy Swan
>>
>
> --
> Marcus Wagner, Dipl.-Inf.
>
> IT Center
> Abteilung: Systeme und Betrieb
> RWTH Aachen University
> Seffenter Weg 23
> 52074 Aachen
> Tel: +49 241 80-24383
> Fax: +49 241 80-624383
> wagner at itc.rwth-aachen.de <mailto:wagner at itc.rwth-aachen.de>
> www.itc.rwth-aachen.de <http://www.itc.rwth-aachen.de>
>
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191009/15293137/attachment-0001.htm>
More information about the slurm-users
mailing list