[slurm-users] Running job is canceled when starting a new job from queue

Uwe Seher uwe.seher at gmail.com
Tue Oct 29 10:24:01 UTC 2019


Hi all!

I think i solved the problem
The system is an opensuse leap 15 installation and slurm comes from the
repository. By default a slurm.epilog.clean skript is installed which kills
everything that belongs to the user when a job is finished including other
jobs, ssh-sessions and so on. I do not know if other distributions do the
same or if the script is broken, but removing it solved the problem.

Thank you!
Uwe Seher

Am Mo., 28. Okt. 2019 um 15:47 Uhr schrieb Uwe Seher <uwe.seher at gmail.com>:

> Hello!
> I cannot fond any hints on oom-kills, but it is systemd so i need maybe a
> little more time searching. We have 128GB mem on the node and the tasks do
> not use this to the limit we know, dependencies have also worked fine with
> the same tasks. Monitoring does not show any problems with memory. The task
> are running without a timelimit, so this should not be the reason.
>
> Thank you for the moment, when i get som more information i'll get back
> here.
> Uwe Seher
>
>
> Am Mo., 28. Okt. 2019 um 14:06 Uhr schrieb Lech Nieroda <
> lech.nieroda at uni-koeln.de>:
>
>> Hello Uwe,
>>
>> when the requested time limit of a job runs out the job is cancelled and
>> terminated with signal SIGTERM (15) and later on SIGKILL (9) if that should
>> fail, the job gets the state „TIMEOUT“.
>> However the job 161 gets killed immediately by SIGKILL and gets the state
>> „FAILED“. That suggest that it wasn’t due to a timeout but something
>> external. It might have been an out-of-memory kill of the system. Did the
>> syslog contain any clues?
>>
>> Kind regards,
>> Lech
>>
>> > Am 28.10.2019 um 13:33 schrieb Uwe Seher <uwe.seher at gmail.com>:
>> >
>> > Hello group!
>> > While running our first jobs i git a strange issue while running
>> multiple Jobs on a single partition.
>> > The partition is a single Node with 32 cores and 128GB memory. ther is
>> a queue with three jobs each should use 15 cores, memory usage is not
>> important. As planned 2 jobs are running, sharing the node as expected (job
>> 160 and 161 in the below log) and onr is waitimg (168). After the first job
>> is completed job 168 starts as expected. But after that the other running
>> job 161 is terminated with exit code 9 ( Ran out of CPU time ). At the End
>> the new started Job 168 is also terminate with exit code 9. On another node
>> the same happens, but the new stareed job ist running as expected.
>> > I suspect that there is a problem in freeing the resources (here:
>> cores) but i have no clue how to avoid this issue. The logs from below and
>> the slurmd.log of the node are also in the attachment.
>> >
>> > Thank you in advance
>> > Uwe Seher
>> >
>> >
>> > slurmctld.log:
>> > [2019-10-27T06:33:27.735] debug:  sched: Running job scheduler
>> > [2019-10-27T06:33:54.970] debug:  backfill: beginning
>> > [2019-10-27T06:33:54.970] debug:  backfill: 1 jobs to backfill
>> > [2019-10-27T06:34:02.328] _job_complete: JobID=160 State=0x1 NodeCnt=1
>> WEXITSTATUS 0
>> > [2019-10-27T06:34:02.328] email msg to bla at blubb.com: SLURM Job_id=160
>> Name=1805-Modell-v201 Ended, Run time 1-17:46:26, COMPLETED, ExitCode 0
>> > [2019-10-27T06:34:02.331] _job_complete: JobID=160 State=0x8003
>> NodeCnt=1 done
>> > [2019-10-27T06:34:03.665] debug:  sched: Running job scheduler
>> > [2019-10-27T06:34:03.665] email msg to bla at blubb.com: SLURM Job_id=168
>> Name=1805-Modell-v206 Began, Queued time 1-17:00:19
>> > [2019-10-27T06:34:03.667] sched: Allocate JobID=168 NodeList=vhost-2
>> #CPUs=15 Partition=vh2
>> > [2019-10-27T06:34:03.708] _job_complete: JobID=161 State=0x1 NodeCnt=1
>> WTERMSIG 9
>> > [2019-10-27T06:34:03.709] email msg to bla at blubb.com: SLURM Job_id=161
>> Name=1805-Modell-v202 Failed, Run time 1-17:46:15, FAILED
>> > [2019-10-27T06:34:03.710] _job_complete: JobID=161 State=0x8005
>> NodeCnt=1 done
>> > [2019-10-27T06:34:06.999] debug:  sched: Running job scheduler
>> > [2019-10-27T06:34:06.999] _job_complete: JobID=168 State=0x1 NodeCnt=1
>> WTERMSIG 9
>> > [2019-10-27T06:34:07.000] email msg to bla at blubb.com: SLURM Job_id=168
>> Name=1805-Modell-v206 Failed, Run time 00:00:03, FAILED
>> > [2019-10-27T06:34:07.001] _job_complete: JobID=168 State=0x8005
>> NodeCnt=1 done
>> >
>> > slurm.conf:
>> > # SCHEDULING
>> > SchedulerType=sched/backfill
>> > #SchedulerAuth=
>> > #SelectType=select/linear
>> > SelectType=select/cons_res
>> > SelectTypeParameters=CR_Core
>> > FastSchedule=1
>> > OverTimeLimit=UNLIMITED
>> >
>> > <2019-10-28_logfiles.txt>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191029/5b637b04/attachment-0001.htm>


More information about the slurm-users mailing list