[slurm-users] Running job is canceled when starting a new job from queue

Mon Oct 28 12:33:12 UTC 2019

Hello group!
While running our first jobs i git a strange issue while running multiple
Jobs on a single partition.
The partition is a single Node with 32 cores and 128GB memory. ther is a
queue with three jobs each should use 15 cores, memory usage is not
important. As planned 2 jobs are running, sharing the node as expected (job
160 and 161 in the below log) and onr is waitimg (168). After the first job
is completed job 168 starts as expected. But after that the other running
job 161 is terminated with exit code 9 ( Ran out of CPU time ). At the End
the new started Job 168 is also terminate with exit code 9. On another node
the same happens, but the new stareed job ist running as expected.
I suspect that there is a problem in freeing the resources (here: cores)
but i have no clue how to avoid this issue. The logs from below and the
slurmd.log of the node are also in the attachment.

Thank you in advance
Uwe Seher

slurmctld.log:
[2019-10-27T06:33:27.735] debug:  sched: Running job scheduler
[2019-10-27T06:33:54.970] debug:  backfill: beginning
[2019-10-27T06:33:54.970] debug:  backfill: 1 jobs to backfill
[2019-10-27T06:34:02.328] _job_complete: JobID=160 State=0x1 NodeCnt=1
WEXITSTATUS 0
[2019-10-27T06:34:02.328] email msg to bla at blubb.com: SLURM Job_id=160
Name=1805-Modell-v201 Ended, Run time 1-17:46:26, COMPLETED, ExitCode 0
[2019-10-27T06:34:02.331] _job_complete: JobID=160 State=0x8003 NodeCnt=1
done
[2019-10-27T06:34:03.665] debug:  sched: Running job scheduler
[2019-10-27T06:34:03.665] email msg to bla at blubb.com: SLURM Job_id=168
Name=1805-Modell-v206 Began, Queued time 1-17:00:19
[2019-10-27T06:34:03.667] sched: Allocate JobID=168 NodeList=vhost-2
#CPUs=15 Partition=vh2
[2019-10-27T06:34:03.708] _job_complete: JobID=161 State=0x1 NodeCnt=1
WTERMSIG 9
[2019-10-27T06:34:03.709] email msg to bla at blubb.com: SLURM Job_id=161
Name=1805-Modell-v202 Failed, Run time 1-17:46:15, FAILED
[2019-10-27T06:34:03.710] _job_complete: JobID=161 State=0x8005 NodeCnt=1
done
[2019-10-27T06:34:06.999] debug:  sched: Running job scheduler
[2019-10-27T06:34:06.999] _job_complete: JobID=168 State=0x1 NodeCnt=1
WTERMSIG 9
[2019-10-27T06:34:07.000] email msg to bla at blubb.com: SLURM Job_id=168
Name=1805-Modell-v206 Failed, Run time 00:00:03, FAILED
[2019-10-27T06:34:07.001] _job_complete: JobID=168 State=0x8005 NodeCnt=1
done

slurm.conf:
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SelectType=select/linear
SelectType=select/cons_res
SelectTypeParameters=CR_Core
FastSchedule=1
OverTimeLimit=UNLIMITED
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191028/086ebafd/attachment.htm>
-------------- next part --------------
slurmctld.log
[2019-10-27T06:33:27.735] debug:  sched: Running job scheduler
[2019-10-27T06:33:54.970] debug:  backfill: beginning
[2019-10-27T06:33:54.970] debug:  backfill: 1 jobs to backfill
[2019-10-27T06:34:02.328] _job_complete: JobID=160 State=0x1 NodeCnt=1 WEXITSTATUS 0
[2019-10-27T06:34:02.328] email msg to bla at blubb.com: SLURM Job_id=160 Name=1805-Modell-v201 Ended, Run time 1-17:46:26, COMPLETED, ExitCode 0
[2019-10-27T06:34:02.331] _job_complete: JobID=160 State=0x8003 NodeCnt=1 done
[2019-10-27T06:34:03.665] debug:  sched: Running job scheduler
[2019-10-27T06:34:03.665] email msg to bla at blubb.com: SLURM Job_id=168 Name=1805-Modell-v206 Began, Queued time 1-17:00:19
[2019-10-27T06:34:03.667] sched: Allocate JobID=168 NodeList=vhost-2 #CPUs=15 Partition=vh2
[2019-10-27T06:34:03.708] _job_complete: JobID=161 State=0x1 NodeCnt=1 WTERMSIG 9
[2019-10-27T06:34:03.709] email msg to bla at blubb.com: SLURM Job_id=161 Name=1805-Modell-v202 Failed, Run time 1-17:46:15, FAILED
[2019-10-27T06:34:03.710] _job_complete: JobID=161 State=0x8005 NodeCnt=1 done
[2019-10-27T06:34:06.999] debug:  sched: Running job scheduler
[2019-10-27T06:34:06.999] _job_complete: JobID=168 State=0x1 NodeCnt=1 WTERMSIG 9
[2019-10-27T06:34:07.000] email msg to bla at blubb.com: SLURM Job_id=168 Name=1805-Modell-v206 Failed, Run time 00:00:03, FAILED
[2019-10-27T06:34:07.001] _job_complete: JobID=168 State=0x8005 NodeCnt=1 done

slumrd.log (vhost-2)
[2019-10-27T06:33:24.468] [160.0] task 0 (41992) exited with exit code 0.
[2019-10-27T06:33:24.469] [160.0] debug:  task_p_post_term: 160.0, task 0
[2019-10-27T06:33:24.480] [160.0] debug:  step_terminate_monitor_stop signaling condition
[2019-10-27T06:33:24.481] [160.0] debug:  Waiting for IO
[2019-10-27T06:33:24.481] [160.0] debug:  Closing debug channel
[2019-10-27T06:33:24.481] [160.0] debug:  IO handler exited, rc=0
[2019-10-27T06:33:24.486] [160.0] debug:  Message thread exited
[2019-10-27T06:33:24.487] [160.0] done with job
[2019-10-27T06:33:36.834] [160.batch] debug:  jag_common_poll_data: Task average frequency = 1389 pid 41873 mem size 119852 1676580 time 19.880000(15+3)
[2019-10-27T06:33:40.013] [161.0] debug:  jag_common_poll_data: Task average frequency = 0 pid 42216 mem size 26488 371144 time 7.680000(2+5)
[2019-10-27T06:33:48.393] [161.batch] debug:  jag_common_poll_data: Task average frequency = 872 pid 41926 mem size 52648 1676664 time 11.120000(5+5)
[2019-10-27T06:34:02.316] [160.batch] task 0 (41873) exited with exit code 0.
[2019-10-27T06:34:02.317] [160.batch] debug:  task_p_post_term: 160.4294967294, task 0
[2019-10-27T06:34:02.324] [160.batch] debug:  step_terminate_monitor_stop signaling condition
[2019-10-27T06:34:02.325] [160.batch] job 160 completed with slurm_rc = 0, job_rc = 0
[2019-10-27T06:34:02.325] [160.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2019-10-27T06:34:02.332] [160.batch] debug:  Message thread exited
[2019-10-27T06:34:02.332] [160.batch] done with job
[2019-10-27T06:34:02.332] debug:  _rpc_terminate_job, uid = 400
[2019-10-27T06:34:02.332] debug:  task_p_slurmd_release_resources: 160
[2019-10-27T06:34:02.332] debug:  credential for job 160 revoked
[2019-10-27T06:34:02.335] debug:  Waiting for job 160's prolog to complete
[2019-10-27T06:34:02.335] debug:  Finished wait for job 160's prolog to complete
[2019-10-27T06:34:02.335] debug:  [job 160] attempting to run epilog [/etc/slurm/slurm.epilog.clean]
[2019-10-27T06:34:03.617] [161.batch] task 0 (41926) exited. Killed by signal 9.
[2019-10-27T06:34:03.618] [161.batch] debug:  task_p_post_term: 161.4294967294, task 0
[2019-10-27T06:34:03.625] [161.batch] debug:  step_terminate_monitor_stop signaling condition
[2019-10-27T06:34:03.626] [161.batch] error: get_exit_code task 0 died by signal
[2019-10-27T06:34:03.626] [161.batch] job 161 completed with slurm_rc = 0, job_rc = 9
[2019-10-27T06:34:03.626] [161.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
[2019-10-27T06:34:03.636] debug:  completed epilog for jobid 160
[2019-10-27T06:34:03.650] [161.0] task 0 (42216) exited. Killed by signal 9.
[2019-10-27T06:34:03.651] [161.0] debug:  task_p_post_term: 161.0, task 0
[2019-10-27T06:34:03.655] [161.0] debug:  step_terminate_monitor_stop signaling condition
[2019-10-27T06:34:03.655] [161.0] debug:  Waiting for IO
[2019-10-27T06:34:03.655] [161.0] debug:  Closing debug channel
[2019-10-27T06:34:03.655] [161.0] debug:  IO handler exited, rc=0
[2019-10-27T06:34:03.655] [161.0] debug:  _send_srun_resp_msg: 0/5 failed to send msg type 6003: Connection refused
[2019-10-27T06:34:03.663] debug:  Job 160: sent epilog complete msg: rc = 0
[2019-10-27T06:34:03.711] [161.batch] debug:  Message thread exited
[2019-10-27T06:34:03.711] [161.batch] done with job
[2019-10-27T06:34:03.712] debug:  task_p_slurmd_batch_request: 168
[2019-10-27T06:34:03.713] _run_prolog: run job script took usec=26
[2019-10-27T06:34:03.713] _run_prolog: prolog with lock for job 168 ran for 0 seconds
[2019-10-27T06:34:03.713] Launching batch job 168 for UID 65537
[2019-10-27T06:34:03.716] debug:  _rpc_terminate_job, uid = 400
[2019-10-27T06:34:03.716] debug:  task_p_slurmd_release_resources: 161
[2019-10-27T06:34:03.717] debug:  credential for job 161 revoked
[2019-10-27T06:34:03.717] [161.0] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2019-10-27T06:34:03.717] [161.0] debug:  _handle_signal_container for step=161.0 uid=0 signal=999
[2019-10-27T06:34:03.719] [161.0] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2019-10-27T06:34:03.719] [161.0] debug:  _handle_signal_container for step=161.0 uid=0 signal=18
[2019-10-27T06:34:03.719] [161.0] Error sending signal 18 to 161.0: No such process
[2019-10-27T06:34:03.720] [161.0] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2019-10-27T06:34:03.720] [161.0] debug:  _handle_signal_container for step=161.0 uid=0 signal=15
[2019-10-27T06:34:03.720] [161.0] Error sending signal 15 to 161.0: No such process
[2019-10-27T06:34:03.721] [161.0] debug:  Handling REQUEST_STATE
[2019-10-27T06:34:03.755] [161.0] debug:  _send_srun_resp_msg: 1/5 failed to send msg type 6003: Connection refused
[2019-10-27T06:34:03.837] debug:  switch NONE plugin loaded
[2019-10-27T06:34:03.866] [168.batch] debug:  AcctGatherProfile NONE plugin loaded
[2019-10-27T06:34:03.869] [168.batch] debug:  AcctGatherEnergy NONE plugin loaded
[2019-10-27T06:34:03.871] [168.batch] debug:  AcctGatherInterconnect NONE plugin loaded
[2019-10-27T06:34:03.872] [168.batch] debug:  AcctGatherFilesystem NONE plugin loaded
[2019-10-27T06:34:03.883] [168.batch] debug:  Job accounting gather LINUX plugin loaded
[2019-10-27T06:34:03.884] [168.batch] debug:  cont_id hasn't been set yet not running poll
[2019-10-27T06:34:03.884] [168.batch] debug:  laying out the 15 tasks on 1 hosts vhost-2 dist 2
[2019-10-27T06:34:03.884] [168.batch] debug:  Message thread started pid = 95376
[2019-10-27T06:34:03.893] [168.batch] debug:  task NONE plugin loaded
[2019-10-27T06:34:03.899] [168.batch] debug:  Checkpoint plugin loaded: checkpoint/none
[2019-10-27T06:34:03.900] [168.batch] Munge cryptographic signature plugin loaded
[2019-10-27T06:34:03.902] [168.batch] debug:  job_container none plugin loaded
[2019-10-27T06:34:03.902] [168.batch] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2019-10-27T06:34:03.905] [168.batch] debug level = 2
[2019-10-27T06:34:03.905] [168.batch] starting 1 tasks
[2019-10-27T06:34:03.906] [168.batch] task 0 (95385) started 2019-10-27T06:34:03
[2019-10-27T06:34:03.906] [168.batch] debug:  task_p_pre_launch_priv: 168.4294967294
[2019-10-27T06:34:03.911] [168.batch] debug:  jag_common_poll_data: Task average frequency = 2400 pid 95385 mem size 1128 204868 time 0.000000(0+0)
[2019-10-27T06:34:03.911] [168.batch] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2019-10-27T06:34:03.912] [168.batch] debug:  task_p_pre_launch: 168.4294967294, task 0
[2019-10-27T06:34:03.912] [168.batch] debug:  Couldn't find SLURM_RLIMIT_MEMLOCK in environment
[2019-10-27T06:34:03.955] [161.0] debug:  _send_srun_resp_msg: 2/5 failed to send msg type 6003: Connection refused
[2019-10-27T06:34:04.356] [161.0] debug:  _send_srun_resp_msg: 3/5 failed to send msg type 6003: Connection refused
[2019-10-27T06:34:04.722] [161.0] debug:  Handling REQUEST_STATE
[2019-10-27T06:34:05.156] [161.0] debug:  _send_srun_resp_msg: 4/5 failed to send msg type 6003: Connection refused
[2019-10-27T06:34:05.723] [161.0] debug:  Handling REQUEST_STATE
[2019-10-27T06:34:05.956] [161.0] debug:  _send_srun_resp_msg: 5/5 failed to send msg type 6003: Connection refused
[2019-10-27T06:34:05.956] [161.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2019-10-27T06:34:05.956] [161.0] error: get_exit_code task 0 died by signal
[2019-10-27T06:34:05.960] [161.0] debug:  Message thread exited
[2019-10-27T06:34:05.960] [161.0] done with job
[2019-10-27T06:34:06.724] debug:  Waiting for job 161's prolog to complete
[2019-10-27T06:34:06.724] debug:  Finished wait for job 161's prolog to complete
[2019-10-27T06:34:06.724] debug:  [job 161] attempting to run epilog [/etc/slurm/slurm.epilog.clean]
[2019-10-27T06:34:06.990] [168.batch] task 0 (95385) exited. Killed by signal 9.
[2019-10-27T06:34:06.991] [168.batch] debug:  task_p_post_term: 168.4294967294, task 0
[2019-10-27T06:34:06.994] [168.batch] debug:  step_terminate_monitor_stop signaling condition
[2019-10-27T06:34:06.994] [168.batch] error: get_exit_code task 0 died by signal
[2019-10-27T06:34:06.994] [168.batch] job 168 completed with slurm_rc = 0, job_rc = 9
[2019-10-27T06:34:06.994] [168.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
[2019-10-27T06:34:06.996] debug:  completed epilog for jobid 161
[2019-10-27T06:34:06.997] debug:  Job 161: sent epilog complete msg: rc = 0
[2019-10-27T06:34:07.001] [168.batch] debug:  Message thread exited
[2019-10-27T06:34:07.001] [168.batch] done with job
[2019-10-27T06:34:07.002] debug:  _rpc_terminate_job, uid = 400
[2019-10-27T06:34:07.002] debug:  task_p_slurmd_release_resources: 168
[2019-10-27T06:34:07.002] debug:  credential for job 168 revoked
[2019-10-27T06:34:07.003] debug:  Waiting for job 168's prolog to complete
[2019-10-27T06:34:07.003] debug:  Finished wait for job 168's prolog to complete
[2019-10-27T06:34:07.003] debug:  [job 168] attempting to run epilog [/etc/slurm/slurm.epilog.clean]
[2019-10-27T06:34:07.116] debug:  completed epilog for jobid 168
[2019-10-27T06:34:07.117] debug:  Job 168: sent epilog complete msg: rc = 0

slurm.conf:
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SelectType=select/linear
SelectType=select/cons_res
SelectTypeParameters=CR_Core
FastSchedule=1
OverTimeLimit=UNLIMITED