[slurm-users] Restarted job fails immediately
Koso Kashima
kashima.biz at gmail.com
Thu Nov 28 00:48:39 UTC 2019
Hi All,
I want to preempt a job from a low-priority partition and restart it when
resources are available again, but a restarted job fails immediately.
Are there any manners or configurations for job preemption?
I used slurm-docker-cluster to build the slurm cluster for testing. The
same problem occurred in other clusters deployed with NVIDA/DeepOps.
The relevant configurations are as follows:
SLURM_VERSION = 19.05.1-2
SchedulerType = sched/backfill
SelectType = select/cons_res
SelectTypeParameters = CR_CPU_MEMORY
PreemptMode = REQUEUE
PreemptType = preempt/partition_prio
Low-priority partition info:
PartitionName=common.q
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
Nodes=c[1-2]
PriorityJobFactor=1 PriorityTier=10 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=REQUEUE
State=UP TotalCPUs=2 TotalNodes=2 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=UNLIMITED MaxMemPerNode=UNLIMITED
High-priority partition info:
PartitionName=sepang.q
AllowGroups=ALL AllowAccounts=sepang AllowQos=ALL
AllocNodes=ALL Default=NO QoS=sepang
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
Nodes=c[1-2]
PriorityJobFactor=1 PriorityTier=20 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=2 TotalNodes=2 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=UNLIMITED MaxMemPerNode=UNLIMITED
I executed following script.
The first job was preempted by the third job and requeued.
The requeued job restarted but sometimes failed immediately.
```
#!/usr/bin/env bash
set -ex
time=130
sbatch -p common.q -n 1 --wrap="srun -n 1 sleep $time"
sleep 5
sbatch -p common.q -n 1 --wrap="srun -n 1 sleep $time"
sleep 5
sbatch -p sepang.q -n 1 --wrap="srun -n 1 sleep $time"
sleep 1
```
It seems that jobs that restarted by the main scheduling loop or events
fail immediately,
and jobs that restarted by the backfill scheduler complete successfully.
What is the difference between "sched" and "backfill".
The following JobId 117 and 120 jobs were both preempted and requeued.
JobId 117 was restarted by sched, but jobid 120 by backfill.
If the script sleep for 120 seconds, the backfill will tend to restart the
job, and if 130 seconds, sched will restart.
JobId 117-119 are the former and jobid 120-122 are the latter.
```
# sacct -X -o jobid,jobname,partition,alloccpus,state,exitcode
JobID JobName Partition AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- --------
118 wrap common.q 1 COMPLETED 0:0
117 wrap common.q 1 COMPLETED 0:0
119 wrap sepang.q 1 COMPLETED 0:0
121 wrap common.q 1 COMPLETED 0:0
122 wrap sepang.q 1 COMPLETED 0:0
120 wrap common.q 1 FAILED 1:0
```
slurmctld.log:
```
# FAILED case
[2019-11-26T10:07:35.322] sched: Allocate JobId=120 NodeList=c2 #CPUs=1
Partition=common.q
[2019-11-26T10:07:35.322] debug2: _group_cache_lookup_internal: found valid
entry for tanaka
[2019-11-26T10:07:35.323] debug2: Spawning RPC agent for msg_type
REQUEST_BATCH_JOB_LAUNCH
[2019-11-26T10:07:35.323] debug2: Tree head got back 0 looking for 1
[2019-11-26T10:07:35.324] debug2: Tree head got back 1
[2019-11-26T10:07:35.328] debug2: node_did_resp c2
[2019-11-26T10:07:35.374] debug2: Processing RPC:
REQUEST_JOB_PACK_ALLOC_INFO from uid=1016
[2019-11-26T10:07:35.374] debug: _slurm_rpc_job_pack_alloc_info: JobId=120
NodeList=c2 usec=81
[2019-11-26T10:07:35.385] debug2: Processing RPC:
REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=120
[2019-11-26T10:07:35.386] _job_complete: JobId=120 WEXITSTATUS 1
# COMPLETED case
[2019-11-26T09:57:39.243] backfill: Started JobId=117 in common.q on c1
[2019-11-26T09:57:39.243] debug2: _group_cache_lookup_internal: found valid
entry for tanaka
[2019-11-26T09:57:39.243] debug2: altering JobId=117 QOS normal got
31536000 just removed 257698037700 and added 31536000
[2019-11-26T09:57:39.243] debug2: altering JobId=117 QOS normal got
31536000 just removed 257698037700 and added 31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 QOS normal got
31536000 just removed 257698037700 and added 31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
12(sepang/tanaka/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
12(sepang/tanaka/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
12(sepang/tanaka/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
5(sepang/(null)/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
5(sepang/(null)/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
5(sepang/(null)/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
1(root/(null)/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
1(root/(null)/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
1(root/(null)/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: Spawning RPC agent for msg_type
REQUEST_BATCH_JOB_LAUNCH
[2019-11-26T09:57:39.245] debug2: Tree head got back 0 looking for 1
[2019-11-26T09:57:39.246] debug2: Tree head got back 1
[2019-11-26T09:57:39.250] debug2: node_did_resp c1
[2019-11-26T09:57:39.277] debug2: Processing RPC:
REQUEST_JOB_PACK_ALLOC_INFO from uid=1016
[2019-11-26T09:57:39.277] debug: _slurm_rpc_job_pack_alloc_info: JobId=117
NodeList=c1 usec=78
[2019-11-26T09:57:39.278] debug: laying out the 1 tasks on 1 hosts c1 dist
2
[2019-11-26T09:57:39.279] debug2: _group_cache_lookup_internal: found valid
entry for tanaka
[2019-11-26T09:58:02.533] debug2: Testing job time limits and checkpoints
[2019-11-26T09:58:09.244] debug: backfill: beginning
[2019-11-26T09:58:09.244] debug: backfill: no jobs to backfill
[2019-11-26T09:58:25.559] debug2: Performing purge of old job records
[2019-11-26T09:58:25.559] debug2: purge_old_job: purged 1 old job records
[2019-11-26T09:58:25.559] debug2: _purge_files_thread: starting, 1 jobs to
purge
[2019-11-26T09:58:25.559] debug2: _purge_files_thread: purging files from
JobId=115
[2019-11-26T09:58:25.559] debug: sched: Running job scheduler
[2019-11-26T09:58:32.567] debug2: Testing job time limits and checkpoints
[2019-11-26T09:58:39.245] debug: backfill: beginning
[2019-11-26T09:58:39.245] debug: backfill: no jobs to backfill
[2019-11-26T09:59:02.601] debug2: Testing job time limits and checkpoints
[2019-11-26T09:59:25.627] debug2: Performing purge of old job records
[2019-11-26T09:59:25.627] debug: sched: Running job scheduler
[2019-11-26T09:59:32.635] debug2: Testing job time limits and checkpoints
[2019-11-26T09:59:39.324] debug2: full switch release for JobId=117
StepId=1, nodes c1
[2019-11-26T09:59:39.342] debug2: Processing RPC:
REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=117
[2019-11-26T09:59:39.342] _job_complete: JobId=117 WEXITSTATUS 0
[2019-11-26T09:59:39.342] _job_complete: JobId=117 done
```
slurmd.log:
```
# FAILED case
[2019-11-26T10:07:35.341] [120.batch] task 0 (1108) started
2019-11-26T10:07:35
[2019-11-26T10:07:35.342] [120.batch] debug: task_p_pre_launch_priv:
120.4294967294
[2019-11-26T10:07:35.342] [120.batch] debug2: adding task 0 pid 1108 on
node 0 to jobacct
[2019-11-26T10:07:35.344] [120.batch] debug2: _get_precs: energy = 0 watts
= 0
[2019-11-26T10:07:35.356] [120.batch] debug2: xcgroup_load: unable to get
cgroup '(null)/cpuset' entry '(null)/cpuset/system' properties: No such
file or directory
[2019-11-26T10:07:35.356] [120.batch] debug2: xcgroup_load: unable to get
cgroup '(null)/memory' entry '(null)/memory/system' properties: No such
file or directory
[2019-11-26T10:07:35.358] [120.batch] debug: task_p_pre_launch:
120.4294967294, task 0
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_CPU no change in value: 18446744073709551615
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_FSIZE no change in value: 18446744073709551615
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_DATA no change in value: 18446744073709551615
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_STACK no change in value: 8388608
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_CORE no change in value: 18446744073709551615
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_RSS no change in value: 18446744073709551615
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: RLIMIT_NPROC :
max:inf cur:inf req:4096
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_NPROC succeeded
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_NOFILE no change in value: 1048576
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_MEMLOCK no change in value: 65536
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_AS no change in value: 18446744073709551615
[2019-11-26T10:07:35.379] [120.batch] debug2: _get_precs: energy = 0 watts
= 0
[2019-11-26T10:07:35.379] [120.batch] debug2: removing task 0 pid
1108 from jobacct
[2019-11-26T10:07:35.380] [120.batch] task 0 (1108) exited with exit code 1.
```
```
# sacct -o jobid,jobname,partition,alloccpus,state,exitcode,submit,start,end,elapsed
-j 117,118,119,120,121,122
JobID JobName Partition AllocCPUS State ExitCode
Submit Start End Elapsed
------------ ---------- ---------- ---------- ---------- --------
------------------- ------------------- ------------------- ----------
118 wrap common.q 1 COMPLETED 0:0
2019-11-26T09:55:29 2019-11-26T09:55:30 2019-11-26T09:57:30 00:02:00
118.batch batch 1 COMPLETED 0:0
2019-11-26T09:55:30 2019-11-26T09:55:30 2019-11-26T09:57:30 00:02:00
118.0 sleep 1 COMPLETED 0:0
2019-11-26T09:55:30 2019-11-26T09:55:30 2019-11-26T09:57:30 00:02:00
117 wrap common.q 1 COMPLETED 0:0
2019-11-26T09:55:34 2019-11-26T09:57:39 2019-11-26T09:59:39 00:02:00
117.batch batch 1 COMPLETED 0:0
2019-11-26T09:57:39 2019-11-26T09:57:39 2019-11-26T09:59:39 00:02:00
117.1 sleep 1 COMPLETED 0:0
2019-11-26T09:57:39 2019-11-26T09:57:39 2019-11-26T09:59:39 00:02:00
119 wrap sepang.q 1 COMPLETED 0:0
2019-11-26T09:55:34 2019-11-26T09:55:34 2019-11-26T09:57:34 00:02:00
119.batch batch 1 COMPLETED 0:0
2019-11-26T09:55:34 2019-11-26T09:55:34 2019-11-26T09:57:34 00:02:00
119.0 sleep 1 COMPLETED 0:0
2019-11-26T09:55:34 2019-11-26T09:55:34 2019-11-26T09:57:34 00:02:00
121 wrap common.q 1 COMPLETED 0:0
2019-11-26T10:05:24 2019-11-26T10:05:25 2019-11-26T10:07:35 00:02:10
121.batch batch 1 COMPLETED 0:0
2019-11-26T10:05:25 2019-11-26T10:05:25 2019-11-26T10:07:35 00:02:10
121.0 sleep 1 COMPLETED 0:0
2019-11-26T10:05:25 2019-11-26T10:05:25 2019-11-26T10:07:35 00:02:10
122 wrap sepang.q 1 COMPLETED 0:0
2019-11-26T10:05:29 2019-11-26T10:05:30 2019-11-26T10:07:40 00:02:10
122.batch batch 1 COMPLETED 0:0
2019-11-26T10:05:30 2019-11-26T10:05:30 2019-11-26T10:07:40 00:02:10
122.0 sleep 1 COMPLETED 0:0
2019-11-26T10:05:30 2019-11-26T10:05:30 2019-11-26T10:07:40 00:02:10
120 wrap common.q 1 FAILED 1:0
2019-11-26T10:05:30 2019-11-26T10:07:35 2019-11-26T10:07:35 00:00:00
120.batch batch 1 FAILED 1:0
2019-11-26T10:07:35 2019-11-26T10:07:35 2019-11-26T10:07:35 00:00:00
```
Thanks,
Koso Kashima
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191128/82b2ef1a/attachment-0001.htm>
More information about the slurm-users
mailing list