[slurm-users] WTERMSIG 15

LEROY Christine 208562 Christine.LEROY2 at cea.fr
Tue Nov 30 11:57:53 UTC 2021


Hi,

Thanks for your feedback.
It seems we are in the 1st case, but then looking deeper: for SL7 node we didn’t encounter the problem thanks to this service configuration (*).
So the solution seems to configure KillMode=process as mention there (**): we will still have jobs listed when doing a 'systemctl status slurmd.service', but they won’t be killed; is that right?

Thanks in advance,
Christine
(**)
https://slurm.schedmd.com/programmer_guide.html
(*)
grep -i killmode /lib/systemd/system/slurmd.service
KillMode=process

Instead of (for ubuntu nodes)
KillMode=control-group

De : slurm-users <slurm-users-bounces at lists.schedmd.com> De la part de Yair Yarom
Envoyé : mardi 30 novembre 2021 08:50
À : Slurm User Community List <slurm-users at lists.schedmd.com>
Objet : Re: [slurm-users] WTERMSIG 15

Hi,

There were two cases where this happened to us as well:
1. The systemd slurmd.service wasn't configured properly, and so the jobs ran under the slurmd.slice. So by restarting slurmd, systemd will send a signal to all processes. You can check if this is the case with 'systemctl status slurmd.service' - the jobs shouldn't be listed there.
2. When changing the partitions, as jobs here are sent to most partitions by default, removing partitions or nodes from partitions might cause the jobs in the relevant partitions to be killed.

HTH,


On Mon, Nov 29, 2021 at 6:46 PM LEROY Christine 208562 <Christine.LEROY2 at cea.fr<mailto:Christine.LEROY2 at cea.fr>> wrote:
Hello all,

I did some modification in my slurm.conf and I’ve restarted the slurmctld on the master and then the slurmd on the nodes.
During this process I’ve lost some jobs (*), curiously all these jobs were on ubuntu nodes .
These jobs were ok with the consumed resources (**).

Any Idea what could be the problem ?
Thanks in advance
Best regards,
Christine Leroy


(*)
[2021-11-29T14:17:09.205] error: Node xxx appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2021-11-29T14:17:10.162] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2021-11-29T14:17:42.223] _job_complete: JobId=4546 WTERMSIG 15
[2021-11-29T14:17:42.223] _job_complete: JobId=4546 done
[2021-11-29T14:17:42.224] _job_complete: JobId=4666 WTERMSIG 15
[2021-11-29T14:17:42.224] _job_complete: JobId=4666 done
[2021-11-29T14:17:42.236] _job_complete: JobId=4665 WTERMSIG 15
[2021-11-29T14:17:42.236] _job_complete: JobId=4665 done
[2021-11-29T14:17:46.072] _job_complete: JobId=4533 WTERMSIG 15
[2021-11-29T14:17:46.072] _job_complete: JobId=4533 done
[2021-11-29T14:17:59.005] _job_complete: JobId=4664 WTERMSIG 15
[2021-11-29T14:17:59.005] _job_complete: JobId=4664 done
[2021-11-29T14:17:59.006] _job_complete: JobId=4663 WTERMSIG 15
[2021-11-29T14:17:59.007] _job_complete: JobId=4663 done
[2021-11-29T14:17:59.021] _job_complete: JobId=4539 WTERMSIG 15
[2021-11-29T14:17:59.021] _job_complete: JobId=4539 done


(**)
# sacct --format=JobID,JobName,ReqCPUS,ReqMem,Start,State,CPUTime,MaxRSS | grep -f /tmp/job15
4533              xterm        1       16Gn 2021-11-24T16:31:32     FAILED 4-21:46:14
4533.batch        batch        1       16Gn 2021-11-24T16:31:32  CANCELLED 4-21:46:14   8893664K
4533.extern      extern        1       16Gn 2021-11-24T16:31:32  COMPLETED 4-21:46:11          0
4539              xterm       16       16Gn 2021-11-24T16:34:25     FAILED 78-11:37:04
4539.batch        batch       16       16Gn 2021-11-24T16:34:25  CANCELLED 78-11:37:04  23781384K
4539.extern      extern       16       16Gn 2021-11-24T16:34:25  COMPLETED 78-11:32:48          0
4546              xterm       16       16Gn 2021-11-24T17:17:54     FAILED 77-23:56:48
4546.batch        batch       16       16Gn 2021-11-24T17:17:54  CANCELLED 77-23:56:48  18541468K
4546.extern      extern       16       16Gn 2021-11-24T17:17:54  COMPLETED 77-23:56:00          0
4663              xterm        1       12Gn 2021-11-26T16:51:12     FAILED 2-21:26:47
4663.batch        batch        1       12Gn 2021-11-26T16:51:12  CANCELLED 2-21:26:47   2275232K
4663.extern      extern        1       12Gn 2021-11-26T16:51:12  COMPLETED 2-21:26:34          0
4664              xterm        1       12Gn 2021-11-26T17:13:42     FAILED 2-21:04:17
4664.batch        batch        1       12Gn 2021-11-26T17:13:42  CANCELLED 2-21:04:17   1484036K
4664.extern      extern        1       12Gn 2021-11-26T17:13:42  COMPLETED 2-21:04:17          0
4665              xterm        1        8Gn 2021-11-26T17:18:12     FAILED 2-20:59:30
4665.batch        batch        1        8Gn 2021-11-26T17:18:12  CANCELLED 2-20:59:30   1159140K
4665.extern      extern        1        8Gn 2021-11-26T17:18:12  COMPLETED 2-20:59:27          0
4666              xterm        1        8Gn 2021-11-26T17:22:12     FAILED 2-20:55:30
4666.batch        batch        1        8Gn 2021-11-26T17:22:12  CANCELLED 2-20:55:30   2090708K
4666.extern      extern        1        8Gn 2021-11-26T17:22:12  COMPLETED 2-20:55:27          0
4711              xterm        4        3Gn 2021-11-29T14:47:09     FAILED   00:20:08
4711.batch        batch        4        3Gn 2021-11-29T14:47:09  CANCELLED   00:20:08     37208K
4711.extern      extern        4        3Gn 2021-11-29T14:47:09  COMPLETED   00:20:00          0
4714          deckbuild       10       30Gn 2021-11-29T14:51:46     FAILED   00:05:20
4714.batch        batch       10       30Gn 2021-11-29T14:51:46  CANCELLED   00:05:20      4036K
4714.extern      extern       10       30Gn 2021-11-29T14:51:46  COMPLETED   00:05:10          0


--

  /|       |

  \/       | Yair Yarom | System Group (DevOps)

  []       | The Rachel and Selim Benin School

  [] /\    | of Computer Science and Engineering

  []//\\/  | The Hebrew University of Jerusalem

  [//  \\  | T +972-2-5494522 | F +972-2-5494522

  //    \  | irush at cs.huji.ac.il<mailto:irush at cs.huji.ac.il>

 //        |
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211130/0a1c39e2/attachment-0001.htm>


More information about the slurm-users mailing list