[slurm-users] srun: Error generating job credential

Eddy Swan eddys at prestolabs.io
Mon Oct 7 06:07:44 UTC 2019


Hi All,

I am currently testing slurm version 19.05.3-2 on Centos 7 with one master
and 3 nodes configuration.
I used the same configuration that works on version 17.02.7 but for some
reasons, it does not work on 19.05.3-2.

$ srun hostname
srun: error: Unable to create step for job 19: Error generating job
credential
srun: Force Terminated job 19

If i run it as root, it works fine.

$ sudo srun hostname
piglet-18

Configuration:
$ cat /etc/slurm/slurm.conf
# Common
ControlMachine=slurm-master
ControlAddr=10.15.131.32
ClusterName=slurm-cluster
RebootProgram="/usr/sbin/reboot"

MailProg=/bin/mail
ProctrackType=proctrack/cgroup
ReturnToService=2
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/cgroup

# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/filetxt
AccountingStorageLoc=/var/log/slurm_acct/slurm_jobacct.log
JobCompLoc=/var/log/slurm_acct/slurm_jobcomp.log
JobAcctGatherType=jobacct_gather/cgroup

# RESOURCES
MemLimitEnforce=no

## Rack 1
NodeName=piglet-19 NodeAddr=10.15.2.19 RealMemory=64000 TmpDisk=512000
Sockets=2 CoresPerSocket=28 ThreadsPerCore=1 CPUSpecList=0,1 Weight=2
NodeName=piglet-18 NodeAddr=10.15.2.18 RealMemory=128000 TmpDisk=512000
Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 CPUSpecList=0,1 Weight=2
NodeName=piglet-17 NodeAddr=10.15.2.17 RealMemory=64000 TmpDisk=512000
Sockets=2 CoresPerSocket=28 ThreadsPerCore=1 CPUSpecList=0,1 Weight=3

# Preempt
PreemptMode=REQUEUE
PreemptType=preempt/qos

PartitionName=batch Nodes=ALL MaxTime=2880 OverSubscribe=YES State=UP
PreemptMode=REQUEUE PriorityTier=10 Default=YES

# TIMERS
KillWait=30
MinJobAge=300
MessageTimeout=3

# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
#SelectTypeParameters=CR_Core_Memory
SelectTypeParameters=CR_CPU_Memory
DefMemPerCPU=128

# Limit
MaxArraySize=201

# slurmctld
SlurmctldDebug=5
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmctldPidFile=/var/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmctldTimeout=60
SlurmUser=slurm

# slurmd
SlurmdDebug=5
SlurmdLogFile=/var/log/slurmd.log
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmdTimeout=300

# REQUEUE
#RequeueExitHold=1-199,201-255
#RequeueExit=200
RequeueExitHold=201-255
RequeueExit=200

Slurmctld.log
[2019-10-07T13:38:47.724] debug:  sched: Running job scheduler
[2019-10-07T13:38:49.254] error: slurm_auth_get_host: Lookup failed:
Unknown host
[2019-10-07T13:38:49.255] sched: _slurm_rpc_allocate_resources JobId=19
NodeList=piglet-18 usec=959
[2019-10-07T13:38:49.259] debug:  laying out the 1 tasks on 1 hosts
piglet-18 dist 2
[2019-10-07T13:38:49.260] error: slurm_cred_create: getpwuid failed for
uid=1000
[2019-10-07T13:38:49.260] error: slurm_cred_create error
[2019-10-07T13:38:49.262] _job_complete: JobId=19 WTERMSIG 1
[2019-10-07T13:38:49.265] _job_complete: JobId=19 done
[2019-10-07T13:38:49.270] debug:  sched: Running job scheduler
[2019-10-07T13:38:56.823] debug:  sched: Running job scheduler
[2019-10-07T13:39:13.504] debug:  backfill: beginning
[2019-10-07T13:39:13.504] debug:  backfill: no jobs to backfill
[2019-10-07T13:39:40.871] debug:  Spawning ping agent for piglet-19
[2019-10-07T13:39:43.504] debug:  backfill: beginning
[2019-10-07T13:39:43.504] debug:  backfill: no jobs to backfill
[2019-10-07T13:39:46.999] error: slurm_auth_get_host: Lookup failed:
Unknown host
[2019-10-07T13:39:47.001] sched: _slurm_rpc_allocate_resources JobId=20
NodeList=piglet-18 usec=979
[2019-10-07T13:39:47.005] debug:  laying out the 1 tasks on 1 hosts
piglet-18 dist 2
[2019-10-07T13:39:47.144] _job_complete: JobId=20 WEXITSTATUS 0
[2019-10-07T13:39:47.147] _job_complete: JobId=20 done
[2019-10-07T13:39:47.158] debug:  sched: Running job scheduler
[2019-10-07T13:39:48.428] error: slurm_auth_get_host: Lookup failed:
Unknown host
[2019-10-07T13:39:48.429] sched: _slurm_rpc_allocate_resources JobId=21
NodeList=piglet-18 usec=1114
[2019-10-07T13:39:48.434] debug:  laying out the 1 tasks on 1 hosts
piglet-18 dist 2
[2019-10-07T13:39:48.559] _job_complete: JobId=21 WEXITSTATUS 0
[2019-10-07T13:39:48.560] _job_complete: JobId=21 done

slurmd.log on piglet-18
[2019-10-07T13:38:42.746] debug:  _rpc_terminate_job, uid = 3001
[2019-10-07T13:38:42.747] debug:  credential for job 17 revoked
[2019-10-07T13:38:47.721] debug:  _rpc_terminate_job, uid = 3001
[2019-10-07T13:38:47.722] debug:  credential for job 18 revoked
[2019-10-07T13:38:49.267] debug:  _rpc_terminate_job, uid = 3001
[2019-10-07T13:38:49.268] debug:  credential for job 19 revoked
[2019-10-07T13:39:47.014] launch task 20.0 request from UID:0 GID:0
HOST:10.15.2.19 PORT:62137
[2019-10-07T13:39:47.014] debug:  Checking credential with 404 bytes of sig
data
[2019-10-07T13:39:47.016] _run_prolog: run job script took usec=7
[2019-10-07T13:39:47.016] _run_prolog: prolog with lock for job 20 ran for
0 seconds
[2019-10-07T13:39:47.026] debug:  AcctGatherEnergy NONE plugin loaded
[2019-10-07T13:39:47.026] debug:  AcctGatherProfile NONE plugin loaded
[2019-10-07T13:39:47.026] debug:  AcctGatherInterconnect NONE plugin loaded
[2019-10-07T13:39:47.026] debug:  AcctGatherFilesystem NONE plugin loaded
[2019-10-07T13:39:47.026] debug:  switch NONE plugin loaded
[2019-10-07T13:39:47.028] [20.0] debug:  CPUs:28 Boards:1 Sockets:2
CoresPerSocket:14 ThreadsPerCore:1
[2019-10-07T13:39:47.028] [20.0] debug:  Job accounting gather cgroup
plugin loaded
[2019-10-07T13:39:47.028] [20.0] debug:  cont_id hasn't been set yet not
running poll
[2019-10-07T13:39:47.029] [20.0] debug:  Message thread started pid = 30331
[2019-10-07T13:39:47.030] [20.0] debug:  task/cgroup: now constraining jobs
allocated cores
[2019-10-07T13:39:47.030] [20.0] debug:  task/cgroup: loaded
[2019-10-07T13:39:47.030] [20.0] debug:  Checkpoint plugin loaded:
checkpoint/none
[2019-10-07T13:39:47.030] [20.0] Munge credential signature plugin loaded
[2019-10-07T13:39:47.031] [20.0] debug:  job_container none plugin loaded
[2019-10-07T13:39:47.031] [20.0] debug:  mpi type = none
[2019-10-07T13:39:47.031] [20.0] debug:  xcgroup_instantiate: cgroup
'/sys/fs/cgroup/freezer/slurm' already exists
[2019-10-07T13:39:47.031] [20.0] debug:  spank: opening plugin stack
/etc/slurm/plugstack.conf
[2019-10-07T13:39:47.031] [20.0] debug:  mpi type = (null)
[2019-10-07T13:39:47.031] [20.0] debug:  mpi/none: slurmstepd prefork
[2019-10-07T13:39:47.031] [20.0] debug:  xcgroup_instantiate: cgroup
'/sys/fs/cgroup/cpuset/slurm' already exists
[2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup: job abstract cores
are '2'
[2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup: step abstract cores
are '2'
[2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup: job physical cores
are '4'
[2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup: step physical cores
are '4'
[2019-10-07T13:39:47.065] [20.0] debug level = 2
[2019-10-07T13:39:47.065] [20.0] starting 1 tasks
[2019-10-07T13:39:47.066] [20.0] task 0 (30336) started 2019-10-07T13:39:47
[2019-10-07T13:39:47.066] [20.0] debug:
 jobacct_gather_cgroup_cpuacct_attach_task: jobid 20 stepid 0 taskid 0
max_task_id 0
[2019-10-07T13:39:47.066] [20.0] debug:  xcgroup_instantiate: cgroup
'/sys/fs/cgroup/cpuacct/slurm' already exists
[2019-10-07T13:39:47.067] [20.0] debug:
 jobacct_gather_cgroup_memory_attach_task: jobid 20 stepid 0 taskid 0
max_task_id 0
[2019-10-07T13:39:47.067] [20.0] debug:  xcgroup_instantiate: cgroup
'/sys/fs/cgroup/memory/slurm' already exists
[2019-10-07T13:39:47.068] [20.0] debug:  IO handler started pid=30331
[2019-10-07T13:39:47.099] [20.0] debug:  jag_common_poll_data: Task 0 pid
30336 ave_freq = 1597534 mem size/max 0/0 vmem size/max
210853888/210853888, disk read size/max (0/0), disk write size/max (0/0),
time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0
[2019-10-07T13:39:47.101] [20.0] debug:  mpi type = (null)
[2019-10-07T13:39:47.101] [20.0] debug:  Using mpi/none
[2019-10-07T13:39:47.102] [20.0] debug:  CPUs:28 Boards:1 Sockets:2
CoresPerSocket:14 ThreadsPerCore:1
[2019-10-07T13:39:47.104] [20.0] debug:  Sending launch resp rc=0
[2019-10-07T13:39:47.105] [20.0] task 0 (30336) exited with exit code 0.
[2019-10-07T13:39:47.139] [20.0] debug:  step_terminate_monitor_stop
signaling condition
[2019-10-07T13:39:47.139] [20.0] debug:  Waiting for IO
[2019-10-07T13:39:47.140] [20.0] debug:  Closing debug channel
[2019-10-07T13:39:47.140] [20.0] debug:  IO handler exited, rc=0
[2019-10-07T13:39:47.148] [20.0] debug:  Message thread exited
[2019-10-07T13:39:47.149] [20.0] done with job

I am not sure what i am missing. Hope someone can point out what i am doing
wrong here.
Thank you.

Best regards,
Eddy Swan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191007/f3281d1f/attachment-0001.htm>


More information about the slurm-users mailing list