<div dir="ltr"><div>Hi all!</div><div><br></div><div>I think i solved the problem<br></div><div>The system is an opensuse leap 15 installation and slurm comes from the repository. By default a slurm.epilog.clean skript is installed which kills everything that belongs to the user when a job is finished including other jobs, ssh-sessions and so on. I do not know if other distributions do the same or if the script is broken, but removing it solved the problem.</div><div><br></div><div>Thank you!</div><div>Uwe Seher <br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Am Mo., 28. Okt. 2019 um 15:47 Uhr schrieb Uwe Seher <<a href="mailto:uwe.seher@gmail.com">uwe.seher@gmail.com</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hello!</div><div>I cannot fond any hints on oom-kills, but it is systemd so i need maybe a little more time searching. We have 128GB mem on the node and the tasks do not use this to the limit we know, dependencies have also worked fine with the same tasks. Monitoring does not show any problems with memory. The task are running without a timelimit, so this should not be the reason. <br></div><div><br></div><div>Thank you for the moment, when i get som more information i'll get back here.</div><div>Uwe Seher<br></div><div> <br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Am Mo., 28. Okt. 2019 um 14:06 Uhr schrieb Lech Nieroda <<a href="mailto:lech.nieroda@uni-koeln.de" target="_blank">lech.nieroda@uni-koeln.de</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello Uwe,<br>
<br>
when the requested time limit of a job runs out the job is cancelled and terminated with signal SIGTERM (15) and later on SIGKILL (9) if that should fail, the job gets the state „TIMEOUT“.<br>
However the job 161 gets killed immediately by SIGKILL and gets the state „FAILED“. That suggest that it wasn’t due to a timeout but something external. It might have been an out-of-memory kill of the system. Did the syslog contain any clues?<br>
<br>
Kind regards,<br>
Lech<br>
<br>
> Am 28.10.2019 um 13:33 schrieb Uwe Seher <<a href="mailto:uwe.seher@gmail.com" target="_blank">uwe.seher@gmail.com</a>>:<br>
> <br>
> Hello group!<br>
> While running our first jobs i git a strange issue while running multiple Jobs on a single partition.<br>
> The partition is a single Node with 32 cores and 128GB memory. ther is a queue with three jobs each should use 15 cores, memory usage is not important. As planned 2 jobs are running, sharing the node as expected (job 160 and 161 in the below log) and onr is waitimg (168). After the first job is completed job 168 starts as expected. But after that the other running job 161 is terminated with exit code 9 ( Ran out of CPU time ). At the End the new started Job 168 is also terminate with exit code 9. On another node the same happens, but the new stareed job ist running as expected. <br>
> I suspect that there is a problem in freeing the resources (here: cores) but i have no clue how to avoid this issue. The logs from below and the slurmd.log of the node are also in the attachment.<br>
> <br>
> Thank you in advance<br>
> Uwe Seher<br>
> <br>
> <br>
> slurmctld.log:<br>
> [2019-10-27T06:33:27.735] debug: sched: Running job scheduler<br>
> [2019-10-27T06:33:54.970] debug: backfill: beginning<br>
> [2019-10-27T06:33:54.970] debug: backfill: 1 jobs to backfill<br>
> [2019-10-27T06:34:02.328] _job_complete: JobID=160 State=0x1 NodeCnt=1 WEXITSTATUS 0<br>
> [2019-10-27T06:34:02.328] email msg to <a href="mailto:bla@blubb.com" target="_blank">bla@blubb.com</a>: SLURM Job_id=160 Name=1805-Modell-v201 Ended, Run time 1-17:46:26, COMPLETED, ExitCode 0<br>
> [2019-10-27T06:34:02.331] _job_complete: JobID=160 State=0x8003 NodeCnt=1 done<br>
> [2019-10-27T06:34:03.665] debug: sched: Running job scheduler<br>
> [2019-10-27T06:34:03.665] email msg to <a href="mailto:bla@blubb.com" target="_blank">bla@blubb.com</a>: SLURM Job_id=168 Name=1805-Modell-v206 Began, Queued time 1-17:00:19<br>
> [2019-10-27T06:34:03.667] sched: Allocate JobID=168 NodeList=vhost-2 #CPUs=15 Partition=vh2<br>
> [2019-10-27T06:34:03.708] _job_complete: JobID=161 State=0x1 NodeCnt=1 WTERMSIG 9<br>
> [2019-10-27T06:34:03.709] email msg to <a href="mailto:bla@blubb.com" target="_blank">bla@blubb.com</a>: SLURM Job_id=161 Name=1805-Modell-v202 Failed, Run time 1-17:46:15, FAILED<br>
> [2019-10-27T06:34:03.710] _job_complete: JobID=161 State=0x8005 NodeCnt=1 done<br>
> [2019-10-27T06:34:06.999] debug: sched: Running job scheduler<br>
> [2019-10-27T06:34:06.999] _job_complete: JobID=168 State=0x1 NodeCnt=1 WTERMSIG 9<br>
> [2019-10-27T06:34:07.000] email msg to <a href="mailto:bla@blubb.com" target="_blank">bla@blubb.com</a>: SLURM Job_id=168 Name=1805-Modell-v206 Failed, Run time 00:00:03, FAILED<br>
> [2019-10-27T06:34:07.001] _job_complete: JobID=168 State=0x8005 NodeCnt=1 done<br>
> <br>
> slurm.conf:<br>
> # SCHEDULING<br>
> SchedulerType=sched/backfill<br>
> #SchedulerAuth=<br>
> #SelectType=select/linear<br>
> SelectType=select/cons_res<br>
> SelectTypeParameters=CR_Core<br>
> FastSchedule=1<br>
> OverTimeLimit=UNLIMITED<br>
> <br>
> <2019-10-28_logfiles.txt><br>
<br>
<br>
</blockquote></div>
</blockquote></div>