[slurm-users] Running job is canceled when starting a new job from queue

Mon Oct 28 14:47:07 UTC 2019

Hello!
I cannot fond any hints on oom-kills, but it is systemd so i need maybe a
little more time searching. We have 128GB mem on the node and the tasks do
not use this to the limit we know, dependencies have also worked fine with
the same tasks. Monitoring does not show any problems with memory. The task
are running without a timelimit, so this should not be the reason.

Thank you for the moment, when i get som more information i'll get back
here.
Uwe Seher

Am Mo., 28. Okt. 2019 um 14:06 Uhr schrieb Lech Nieroda <
lech.nieroda at uni-koeln.de>:

> Hello Uwe,
>
> when the requested time limit of a job runs out the job is cancelled and
> terminated with signal SIGTERM (15) and later on SIGKILL (9) if that should
> fail, the job gets the state „TIMEOUT“.
> However the job 161 gets killed immediately by SIGKILL and gets the state
> „FAILED“. That suggest that it wasn’t due to a timeout but something
> external. It might have been an out-of-memory kill of the system. Did the
> syslog contain any clues?
>
> Kind regards,
> Lech
>
> > Am 28.10.2019 um 13:33 schrieb Uwe Seher <uwe.seher at gmail.com>:
> >
> > Hello group!
> > While running our first jobs i git a strange issue while running
> multiple Jobs on a single partition.
> > The partition is a single Node with 32 cores and 128GB memory. ther is a
> queue with three jobs each should use 15 cores, memory usage is not
> important. As planned 2 jobs are running, sharing the node as expected (job
> 160 and 161 in the below log) and onr is waitimg (168). After the first job
> is completed job 168 starts as expected. But after that the other running
> job 161 is terminated with exit code 9 ( Ran out of CPU time ). At the End
> the new started Job 168 is also terminate with exit code 9. On another node
> the same happens, but the new stareed job ist running as expected.
> > I suspect that there is a problem in freeing the resources (here: cores)
> but i have no clue how to avoid this issue. The logs from below and the
> slurmd.log of the node are also in the attachment.
> >
> > Thank you in advance
> > Uwe Seher
> >
> >
> > slurmctld.log:
> > [2019-10-27T06:33:27.735] debug:  sched: Running job scheduler
> > [2019-10-27T06:33:54.970] debug:  backfill: beginning
> > [2019-10-27T06:33:54.970] debug:  backfill: 1 jobs to backfill
> > [2019-10-27T06:34:02.328] _job_complete: JobID=160 State=0x1 NodeCnt=1
> WEXITSTATUS 0
> > [2019-10-27T06:34:02.328] email msg to bla at blubb.com: SLURM Job_id=160
> Name=1805-Modell-v201 Ended, Run time 1-17:46:26, COMPLETED, ExitCode 0
> > [2019-10-27T06:34:02.331] _job_complete: JobID=160 State=0x8003
> NodeCnt=1 done
> > [2019-10-27T06:34:03.665] debug:  sched: Running job scheduler
> > [2019-10-27T06:34:03.665] email msg to bla at blubb.com: SLURM Job_id=168
> Name=1805-Modell-v206 Began, Queued time 1-17:00:19
> > [2019-10-27T06:34:03.667] sched: Allocate JobID=168 NodeList=vhost-2
> #CPUs=15 Partition=vh2
> > [2019-10-27T06:34:03.708] _job_complete: JobID=161 State=0x1 NodeCnt=1
> WTERMSIG 9
> > [2019-10-27T06:34:03.709] email msg to bla at blubb.com: SLURM Job_id=161
> Name=1805-Modell-v202 Failed, Run time 1-17:46:15, FAILED
> > [2019-10-27T06:34:03.710] _job_complete: JobID=161 State=0x8005
> NodeCnt=1 done
> > [2019-10-27T06:34:06.999] debug:  sched: Running job scheduler
> > [2019-10-27T06:34:06.999] _job_complete: JobID=168 State=0x1 NodeCnt=1
> WTERMSIG 9
> > [2019-10-27T06:34:07.000] email msg to bla at blubb.com: SLURM Job_id=168
> Name=1805-Modell-v206 Failed, Run time 00:00:03, FAILED
> > [2019-10-27T06:34:07.001] _job_complete: JobID=168 State=0x8005
> NodeCnt=1 done
> >
> > slurm.conf:
> > # SCHEDULING
> > SchedulerType=sched/backfill
> > #SchedulerAuth=
> > #SelectType=select/linear
> > SelectType=select/cons_res
> > SelectTypeParameters=CR_Core
> > FastSchedule=1
> > OverTimeLimit=UNLIMITED
> >
> > <2019-10-28_logfiles.txt>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191028/39c91bbc/attachment.htm>