[slurm-users] Running job is canceled when starting a new job from queue

Mon Oct 28 13:03:50 UTC 2019

Hello Uwe,

when the requested time limit of a job runs out the job is cancelled and terminated with signal SIGTERM (15) and later on SIGKILL (9) if that should fail, the job gets the state „TIMEOUT“.
However the job 161 gets killed immediately by SIGKILL and gets the state „FAILED“. That suggest that it wasn’t due to a timeout but something external. It might have been an out-of-memory kill of the system. Did the syslog contain any clues?

Kind regards,
Lech

> Am 28.10.2019 um 13:33 schrieb Uwe Seher <uwe.seher at gmail.com>:
> 
> Hello group!
> While running our first jobs i git a strange issue while running multiple Jobs on a single partition.
> The partition is a single Node with 32 cores and 128GB memory. ther is a queue with three jobs each should use 15 cores, memory usage is not important. As planned 2 jobs are running, sharing the node as expected (job 160 and 161 in the below log) and onr is waitimg (168). After the first job is completed job 168 starts as expected. But after that the other running job 161 is terminated with exit code 9 ( Ran out of CPU time ). At the End the new started Job 168 is also terminate with exit code 9. On another node the same happens, but the new stareed job ist running as expected.  
> I suspect that there is a problem in freeing the resources (here: cores) but i have no clue how to avoid this issue. The logs from below and the slurmd.log of the node are also in the attachment.
> 
> Thank you in advance
> Uwe Seher
> 
> 
> slurmctld.log:
> [2019-10-27T06:33:27.735] debug:  sched: Running job scheduler
> [2019-10-27T06:33:54.970] debug:  backfill: beginning
> [2019-10-27T06:33:54.970] debug:  backfill: 1 jobs to backfill
> [2019-10-27T06:34:02.328] _job_complete: JobID=160 State=0x1 NodeCnt=1 WEXITSTATUS 0
> [2019-10-27T06:34:02.328] email msg to bla at blubb.com: SLURM Job_id=160 Name=1805-Modell-v201 Ended, Run time 1-17:46:26, COMPLETED, ExitCode 0
> [2019-10-27T06:34:02.331] _job_complete: JobID=160 State=0x8003 NodeCnt=1 done
> [2019-10-27T06:34:03.665] debug:  sched: Running job scheduler
> [2019-10-27T06:34:03.665] email msg to bla at blubb.com: SLURM Job_id=168 Name=1805-Modell-v206 Began, Queued time 1-17:00:19
> [2019-10-27T06:34:03.667] sched: Allocate JobID=168 NodeList=vhost-2 #CPUs=15 Partition=vh2
> [2019-10-27T06:34:03.708] _job_complete: JobID=161 State=0x1 NodeCnt=1 WTERMSIG 9
> [2019-10-27T06:34:03.709] email msg to bla at blubb.com: SLURM Job_id=161 Name=1805-Modell-v202 Failed, Run time 1-17:46:15, FAILED
> [2019-10-27T06:34:03.710] _job_complete: JobID=161 State=0x8005 NodeCnt=1 done
> [2019-10-27T06:34:06.999] debug:  sched: Running job scheduler
> [2019-10-27T06:34:06.999] _job_complete: JobID=168 State=0x1 NodeCnt=1 WTERMSIG 9
> [2019-10-27T06:34:07.000] email msg to bla at blubb.com: SLURM Job_id=168 Name=1805-Modell-v206 Failed, Run time 00:00:03, FAILED
> [2019-10-27T06:34:07.001] _job_complete: JobID=168 State=0x8005 NodeCnt=1 done
> 
> slurm.conf:
> # SCHEDULING
> SchedulerType=sched/backfill
> #SchedulerAuth=
> #SelectType=select/linear
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core
> FastSchedule=1
> OverTimeLimit=UNLIMITED
> 
> <2019-10-28_logfiles.txt>