[slurm-users] srun: Error generating job credential
Marcus Wagner
wagner at itc.rwth-aachen.de
Mon Oct 7 06:33:43 UTC 2019
Hi Eddy,
what is the result of "id 1000" on the submithost and on piglet-18?
Best
Marcus
On 10/7/19 8:07 AM, Eddy Swan wrote:
> Hi All,
>
> I am currently testing slurm version 19.05.3-2 on Centos 7 with one
> master and 3 nodes configuration.
> I used the same configuration that works on version 17.02.7 but for
> some reasons, it does not work on 19.05.3-2.
>
> $ srun hostname
> srun: error: Unable to create step for job 19: Error generating job
> credential
> srun: Force Terminated job 19
>
> If i run it as root, it works fine.
>
> $ sudo srun hostname
> piglet-18
>
> Configuration:
> $ cat /etc/slurm/slurm.conf
> # Common
> ControlMachine=slurm-master
> ControlAddr=10.15.131.32
> ClusterName=slurm-cluster
> RebootProgram="/usr/sbin/reboot"
>
> MailProg=/bin/mail
> ProctrackType=proctrack/cgroup
> ReturnToService=2
> StateSaveLocation=/var/spool/slurmctld
> TaskPlugin=task/cgroup
>
> # LOGGING AND ACCOUNTING
> AccountingStorageType=accounting_storage/filetxt
> AccountingStorageLoc=/var/log/slurm_acct/slurm_jobacct.log
> JobCompLoc=/var/log/slurm_acct/slurm_jobcomp.log
> JobAcctGatherType=jobacct_gather/cgroup
>
> # RESOURCES
> MemLimitEnforce=no
>
> ## Rack 1
> NodeName=piglet-19 NodeAddr=10.15.2.19 RealMemory=64000 TmpDisk=512000
> Sockets=2 CoresPerSocket=28 ThreadsPerCore=1 CPUSpecList=0,1 Weight=2
> NodeName=piglet-18 NodeAddr=10.15.2.18 RealMemory=128000
> TmpDisk=512000 Sockets=2 CoresPerSocket=14 ThreadsPerCore=1
> CPUSpecList=0,1 Weight=2
> NodeName=piglet-17 NodeAddr=10.15.2.17 RealMemory=64000 TmpDisk=512000
> Sockets=2 CoresPerSocket=28 ThreadsPerCore=1 CPUSpecList=0,1 Weight=3
>
> # Preempt
> PreemptMode=REQUEUE
> PreemptType=preempt/qos
>
> PartitionName=batch Nodes=ALL MaxTime=2880 OverSubscribe=YES State=UP
> PreemptMode=REQUEUE PriorityTier=10 Default=YES
>
> # TIMERS
> KillWait=30
> MinJobAge=300
> MessageTimeout=3
>
> # SCHEDULING
> FastSchedule=1
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> #SelectTypeParameters=CR_Core_Memory
> SelectTypeParameters=CR_CPU_Memory
> DefMemPerCPU=128
>
> # Limit
> MaxArraySize=201
>
> # slurmctld
> SlurmctldDebug=5
> SlurmctldLogFile=/var/log/slurm/slurmctld.log
> SlurmctldPidFile=/var/slurm/slurmctld.pid
> SlurmctldPort=6817
> SlurmctldTimeout=60
> SlurmUser=slurm
>
> # slurmd
> SlurmdDebug=5
> SlurmdLogFile=/var/log/slurmd.log
> SlurmdPort=6818
> SlurmdSpoolDir=/var/spool/slurmd
> SlurmdTimeout=300
>
> # REQUEUE
> #RequeueExitHold=1-199,201-255
> #RequeueExit=200
> RequeueExitHold=201-255
> RequeueExit=200
>
> Slurmctld.log
> [2019-10-07T13:38:47.724] debug: sched: Running job scheduler
> [2019-10-07T13:38:49.254] error: slurm_auth_get_host: Lookup failed:
> Unknown host
> [2019-10-07T13:38:49.255] sched: _slurm_rpc_allocate_resources
> JobId=19 NodeList=piglet-18 usec=959
> [2019-10-07T13:38:49.259] debug: laying out the 1 tasks on 1 hosts
> piglet-18 dist 2
> [2019-10-07T13:38:49.260] error: slurm_cred_create: getpwuid failed
> for uid=1000
> [2019-10-07T13:38:49.260] error: slurm_cred_create error
> [2019-10-07T13:38:49.262] _job_complete: JobId=19 WTERMSIG 1
> [2019-10-07T13:38:49.265] _job_complete: JobId=19 done
> [2019-10-07T13:38:49.270] debug: sched: Running job scheduler
> [2019-10-07T13:38:56.823] debug: sched: Running job scheduler
> [2019-10-07T13:39:13.504] debug: backfill: beginning
> [2019-10-07T13:39:13.504] debug: backfill: no jobs to backfill
> [2019-10-07T13:39:40.871] debug: Spawning ping agent for piglet-19
> [2019-10-07T13:39:43.504] debug: backfill: beginning
> [2019-10-07T13:39:43.504] debug: backfill: no jobs to backfill
> [2019-10-07T13:39:46.999] error: slurm_auth_get_host: Lookup failed:
> Unknown host
> [2019-10-07T13:39:47.001] sched: _slurm_rpc_allocate_resources
> JobId=20 NodeList=piglet-18 usec=979
> [2019-10-07T13:39:47.005] debug: laying out the 1 tasks on 1 hosts
> piglet-18 dist 2
> [2019-10-07T13:39:47.144] _job_complete: JobId=20 WEXITSTATUS 0
> [2019-10-07T13:39:47.147] _job_complete: JobId=20 done
> [2019-10-07T13:39:47.158] debug: sched: Running job scheduler
> [2019-10-07T13:39:48.428] error: slurm_auth_get_host: Lookup failed:
> Unknown host
> [2019-10-07T13:39:48.429] sched: _slurm_rpc_allocate_resources
> JobId=21 NodeList=piglet-18 usec=1114
> [2019-10-07T13:39:48.434] debug: laying out the 1 tasks on 1 hosts
> piglet-18 dist 2
> [2019-10-07T13:39:48.559] _job_complete: JobId=21 WEXITSTATUS 0
> [2019-10-07T13:39:48.560] _job_complete: JobId=21 done
>
> slurmd.log on piglet-18
> [2019-10-07T13:38:42.746] debug: _rpc_terminate_job, uid = 3001
> [2019-10-07T13:38:42.747] debug: credential for job 17 revoked
> [2019-10-07T13:38:47.721] debug: _rpc_terminate_job, uid = 3001
> [2019-10-07T13:38:47.722] debug: credential for job 18 revoked
> [2019-10-07T13:38:49.267] debug: _rpc_terminate_job, uid = 3001
> [2019-10-07T13:38:49.268] debug: credential for job 19 revoked
> [2019-10-07T13:39:47.014] launch task 20.0 request from UID:0 GID:0
> HOST:10.15.2.19 PORT:62137
> [2019-10-07T13:39:47.014] debug: Checking credential with 404 bytes
> of sig data
> [2019-10-07T13:39:47.016] _run_prolog: run job script took usec=7
> [2019-10-07T13:39:47.016] _run_prolog: prolog with lock for job 20 ran
> for 0 seconds
> [2019-10-07T13:39:47.026] debug: AcctGatherEnergy NONE plugin loaded
> [2019-10-07T13:39:47.026] debug: AcctGatherProfile NONE plugin loaded
> [2019-10-07T13:39:47.026] debug: AcctGatherInterconnect NONE plugin
> loaded
> [2019-10-07T13:39:47.026] debug: AcctGatherFilesystem NONE plugin loaded
> [2019-10-07T13:39:47.026] debug: switch NONE plugin loaded
> [2019-10-07T13:39:47.028] [20.0] debug: CPUs:28 Boards:1 Sockets:2
> CoresPerSocket:14 ThreadsPerCore:1
> [2019-10-07T13:39:47.028] [20.0] debug: Job accounting gather cgroup
> plugin loaded
> [2019-10-07T13:39:47.028] [20.0] debug: cont_id hasn't been set yet
> not running poll
> [2019-10-07T13:39:47.029] [20.0] debug: Message thread started pid =
> 30331
> [2019-10-07T13:39:47.030] [20.0] debug: task/cgroup: now constraining
> jobs allocated cores
> [2019-10-07T13:39:47.030] [20.0] debug: task/cgroup: loaded
> [2019-10-07T13:39:47.030] [20.0] debug: Checkpoint plugin loaded:
> checkpoint/none
> [2019-10-07T13:39:47.030] [20.0] Munge credential signature plugin loaded
> [2019-10-07T13:39:47.031] [20.0] debug: job_container none plugin loaded
> [2019-10-07T13:39:47.031] [20.0] debug: mpi type = none
> [2019-10-07T13:39:47.031] [20.0] debug: xcgroup_instantiate: cgroup
> '/sys/fs/cgroup/freezer/slurm' already exists
> [2019-10-07T13:39:47.031] [20.0] debug: spank: opening plugin stack
> /etc/slurm/plugstack.conf
> [2019-10-07T13:39:47.031] [20.0] debug: mpi type = (null)
> [2019-10-07T13:39:47.031] [20.0] debug: mpi/none: slurmstepd prefork
> [2019-10-07T13:39:47.031] [20.0] debug: xcgroup_instantiate: cgroup
> '/sys/fs/cgroup/cpuset/slurm' already exists
> [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: job abstract
> cores are '2'
> [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: step abstract
> cores are '2'
> [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: job physical
> cores are '4'
> [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: step physical
> cores are '4'
> [2019-10-07T13:39:47.065] [20.0] debug level = 2
> [2019-10-07T13:39:47.065] [20.0] starting 1 tasks
> [2019-10-07T13:39:47.066] [20.0] task 0 (30336) started
> 2019-10-07T13:39:47
> [2019-10-07T13:39:47.066] [20.0] debug:
> jobacct_gather_cgroup_cpuacct_attach_task: jobid 20 stepid 0 taskid 0
> max_task_id 0
> [2019-10-07T13:39:47.066] [20.0] debug: xcgroup_instantiate: cgroup
> '/sys/fs/cgroup/cpuacct/slurm' already exists
> [2019-10-07T13:39:47.067] [20.0] debug:
> jobacct_gather_cgroup_memory_attach_task: jobid 20 stepid 0 taskid 0
> max_task_id 0
> [2019-10-07T13:39:47.067] [20.0] debug: xcgroup_instantiate: cgroup
> '/sys/fs/cgroup/memory/slurm' already exists
> [2019-10-07T13:39:47.068] [20.0] debug: IO handler started pid=30331
> [2019-10-07T13:39:47.099] [20.0] debug: jag_common_poll_data: Task 0
> pid 30336 ave_freq = 1597534 mem size/max 0/0 vmem size/max
> 210853888/210853888, disk read size/max (0/0), disk write size/max
> (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0
> MinPower 0
> [2019-10-07T13:39:47.101] [20.0] debug: mpi type = (null)
> [2019-10-07T13:39:47.101] [20.0] debug: Using mpi/none
> [2019-10-07T13:39:47.102] [20.0] debug: CPUs:28 Boards:1 Sockets:2
> CoresPerSocket:14 ThreadsPerCore:1
> [2019-10-07T13:39:47.104] [20.0] debug: Sending launch resp rc=0
> [2019-10-07T13:39:47.105] [20.0] task 0 (30336) exited with exit code 0.
> [2019-10-07T13:39:47.139] [20.0] debug: step_terminate_monitor_stop
> signaling condition
> [2019-10-07T13:39:47.139] [20.0] debug: Waiting for IO
> [2019-10-07T13:39:47.140] [20.0] debug: Closing debug channel
> [2019-10-07T13:39:47.140] [20.0] debug: IO handler exited, rc=0
> [2019-10-07T13:39:47.148] [20.0] debug: Message thread exited
> [2019-10-07T13:39:47.149] [20.0] done with job
>
> I am not sure what i am missing. Hope someone can point out what i am
> doing wrong here.
> Thank you.
>
> Best regards,
> Eddy Swan
>
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191007/ea2bc0a4/attachment.htm>
More information about the slurm-users
mailing list