[slurm-users] WTERMSIG 15
Yair Yarom
irush at cs.huji.ac.il
Wed Dec 1 09:25:13 UTC 2021
I guess they won't be killed, but having them there could cause other
issues. I.e. any limit that systemd places on the slurmd service will be
applied to the jobs as well, and probably cumulatively.
Do you use cgroup for the slurm resource management (the TaskPlugin)? If so
it means this is not working properly.
We have a lot of customization here, so I can't be sure what change you
need exactly. We have the default KillMode (control-group), and
Delegate=true.
On Tue, Nov 30, 2021 at 2:00 PM LEROY Christine 208562 <
Christine.LEROY2 at cea.fr> wrote:
> Hi,
>
>
>
> Thanks for your feedback.
>
> It seems we are in the 1st case, but then looking deeper: for SL7 node we
> didn’t encounter the problem thanks to this service configuration (*).
>
> So the solution seems to configure KillMode=process as mention there (**):
> we will still have jobs listed when doing a 'systemctl status
> slurmd.service', but they won’t be killed; is that right?
>
>
>
> Thanks in advance,
>
> Christine
>
> (**)
>
> https://slurm.schedmd.com/programmer_guide.html
>
> (*)
>
> grep -i killmode /lib/systemd/system/slurmd.service
>
> KillMode=process
>
>
>
> Instead of (for ubuntu nodes)
>
> KillMode=control-group
>
>
>
> *De :* slurm-users <slurm-users-bounces at lists.schedmd.com> *De la part de*
> Yair Yarom
> *Envoyé :* mardi 30 novembre 2021 08:50
> *À :* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Objet :* Re: [slurm-users] WTERMSIG 15
>
>
>
> Hi,
>
>
>
> There were two cases where this happened to us as well:
>
> 1. The systemd slurmd.service wasn't configured properly, and so the jobs
> ran under the slurmd.slice. So by restarting slurmd, systemd will send a
> signal to all processes. You can check if this is the case with 'systemctl
> status slurmd.service' - the jobs shouldn't be listed there.
>
> 2. When changing the partitions, as jobs here are sent to most partitions
> by default, removing partitions or nodes from partitions might cause the
> jobs in the relevant partitions to be killed.
>
>
>
> HTH,
>
>
>
>
>
> On Mon, Nov 29, 2021 at 6:46 PM LEROY Christine 208562 <
> Christine.LEROY2 at cea.fr> wrote:
>
> Hello all,
>
>
>
> I did some modification in my slurm.conf and I’ve restarted the slurmctld
> on the master and then the slurmd on the nodes.
>
> During this process I’ve lost some jobs (*), curiously all these jobs were
> on ubuntu nodes .
>
> These jobs were ok with the consumed resources (**).
>
>
>
> Any Idea what could be the problem ?
>
> Thanks in advance
>
> Best regards,
>
> Christine Leroy
>
>
>
>
>
> (*)
>
> [2021-11-29T14:17:09.205] error: Node xxx appears to have a different
> slurm.conf than the slurmctld. This could cause issues with communication
> and functionality. Please review both files and make sure they are the
> same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
> slurm.conf.
>
> [2021-11-29T14:17:10.162]
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
>
> [2021-11-29T14:17:42.223] _job_complete: JobId=4546 WTERMSIG 15
>
> [2021-11-29T14:17:42.223] _job_complete: JobId=4546 done
>
> [2021-11-29T14:17:42.224] _job_complete: JobId=4666 WTERMSIG 15
>
> [2021-11-29T14:17:42.224] _job_complete: JobId=4666 done
>
> [2021-11-29T14:17:42.236] _job_complete: JobId=4665 WTERMSIG 15
>
> [2021-11-29T14:17:42.236] _job_complete: JobId=4665 done
>
> [2021-11-29T14:17:46.072] _job_complete: JobId=4533 WTERMSIG 15
>
> [2021-11-29T14:17:46.072] _job_complete: JobId=4533 done
>
> [2021-11-29T14:17:59.005] _job_complete: JobId=4664 WTERMSIG 15
>
> [2021-11-29T14:17:59.005] _job_complete: JobId=4664 done
>
> [2021-11-29T14:17:59.006] _job_complete: JobId=4663 WTERMSIG 15
>
> [2021-11-29T14:17:59.007] _job_complete: JobId=4663 done
>
> [2021-11-29T14:17:59.021] _job_complete: JobId=4539 WTERMSIG 15
>
> [2021-11-29T14:17:59.021] _job_complete: JobId=4539 done
>
>
>
>
>
> (**)
>
> # sacct --format=JobID,JobName,ReqCPUS,ReqMem,Start,State,CPUTime,MaxRSS |
> grep -f /tmp/job15
>
> 4533 xterm 1 16Gn 2021-11-24T16:31:32 FAILED
> 4-21:46:14
>
> 4533.batch batch 1 16Gn 2021-11-24T16:31:32 CANCELLED
> 4-21:46:14 8893664K
>
> 4533.extern extern 1 16Gn 2021-11-24T16:31:32 COMPLETED
> 4-21:46:11 0
>
> 4539 xterm 16 16Gn 2021-11-24T16:34:25 FAILED
> 78-11:37:04
>
> 4539.batch batch 16 16Gn 2021-11-24T16:34:25 CANCELLED
> 78-11:37:04 23781384K
>
> 4539.extern extern 16 16Gn 2021-11-24T16:34:25 COMPLETED
> 78-11:32:48 0
>
> 4546 xterm 16 16Gn 2021-11-24T17:17:54 FAILED
> 77-23:56:48
>
> 4546.batch batch 16 16Gn 2021-11-24T17:17:54 CANCELLED
> 77-23:56:48 18541468K
>
> 4546.extern extern 16 16Gn 2021-11-24T17:17:54 COMPLETED
> 77-23:56:00 0
>
> 4663 xterm 1 12Gn 2021-11-26T16:51:12 FAILED
> 2-21:26:47
>
> 4663.batch batch 1 12Gn 2021-11-26T16:51:12 CANCELLED
> 2-21:26:47 2275232K
>
> 4663.extern extern 1 12Gn 2021-11-26T16:51:12 COMPLETED
> 2-21:26:34 0
>
> 4664 xterm 1 12Gn 2021-11-26T17:13:42 FAILED
> 2-21:04:17
>
> 4664.batch batch 1 12Gn 2021-11-26T17:13:42 CANCELLED
> 2-21:04:17 1484036K
>
> 4664.extern extern 1 12Gn 2021-11-26T17:13:42 COMPLETED
> 2-21:04:17 0
>
> 4665 xterm 1 8Gn 2021-11-26T17:18:12 FAILED
> 2-20:59:30
>
> 4665.batch batch 1 8Gn 2021-11-26T17:18:12 CANCELLED
> 2-20:59:30 1159140K
>
> 4665.extern extern 1 8Gn 2021-11-26T17:18:12 COMPLETED
> 2-20:59:27 0
>
> 4666 xterm 1 8Gn 2021-11-26T17:22:12 FAILED
> 2-20:55:30
>
> 4666.batch batch 1 8Gn 2021-11-26T17:22:12 CANCELLED
> 2-20:55:30 2090708K
>
> 4666.extern extern 1 8Gn 2021-11-26T17:22:12 COMPLETED
> 2-20:55:27 0
>
> 4711 xterm 4 3Gn 2021-11-29T14:47:09
> FAILED 00:20:08
>
> 4711.batch batch 4 3Gn 2021-11-29T14:47:09
> CANCELLED 00:20:08 37208K
>
> 4711.extern extern 4 3Gn 2021-11-29T14:47:09
> COMPLETED 00:20:00 0
>
> 4714 deckbuild 10 30Gn 2021-11-29T14:51:46
> FAILED 00:05:20
>
> 4714.batch batch 10 30Gn 2021-11-29T14:51:46
> CANCELLED 00:05:20 4036K
>
> 4714.extern extern 10 30Gn 2021-11-29T14:51:46
> COMPLETED 00:05:10 0
>
>
>
> --
>
> /| |
>
> \/ | *Yair Yarom *| System Group (DevOps)
>
> [] | *The Rachel and Selim Benin School*
>
> [] /\ | *of Computer Science and Engineering*
>
> []//\\/ | The Hebrew University of Jerusalem
>
> [// \\ | T +972-2-5494522 | F +972-2-5494522
>
> // \ | irush at cs.huji.ac.il
>
> // |
>
>
--
/| |
\/ | Yair Yarom | System Group (DevOps)
[] | The Rachel and Selim Benin School
[] /\ | of Computer Science and Engineering
[]//\\/ | The Hebrew University of Jerusalem
[// \\ | T +972-2-5494522 | F +972-2-5494522
// \ | irush at cs.huji.ac.il
// |
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211201/f7679a0e/attachment-0001.htm>
More information about the slurm-users
mailing list