[slurm-users] All user's jobs killed at the same time on all nodes

Fri Jun 29 04:13:33 MDT 2018

Dear comunity,

I have a user who usually submits 36 (identical) jobs at a time using a simple for loop,
thus jobs are sbatched all the same time.

Each job requests a single core and all jobs are independent from one another (read
different input files and write to different output files).

Jobs are then usually started during the next couple of hours, somewhat at random
times.

What happens then is that after a certain amount of time (maybe from 2 to 12 hours)
ALL jobs belonging to this particular user are killed by slurm on all nodes at exactly the
same time.

One example:

### master: /var/log/slurmctld.log ###

[2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560 InitPrio=4294185624 usec=255
...
[2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on node38
...
[2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560 uid 1007
[2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560 State=0x8004 NodeCnt=1 successful 0x8004

### node38: /var/log/slurmd.log ###

[2018-06-28T19:29:05.410] _run_prolog: prolog with lock for job 718560 ran for 0 seconds
[2018-06-28T19:29:05.410] Launching batch job 718560 for UID 1007
[2018-06-28T19:29:05.427] [718560.batch] Munge cryptographic signature plugin loaded
[2018-06-28T19:29:05.431] [718560.batch] debug level = 2
[2018-06-28T19:29:05.431] [718560.batch] starting 1 tasks
[2018-06-28T19:29:05.431] [718560.batch] task 0 (69791) started 2018-06-28T19:29:05
[2018-06-28T19:29:05.440] [718560.batch] Can't propagate RLIMIT_NOFILE of 65536 from submit host: Operation not permitted
...
[2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69794 (charmm)
[2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69792 (mpirun)
[2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69791 (slurm_script)
[2018-06-28T23:37:53.480] [718560.batch] Sent signal 18 to 718560.429496729
[2018-06-28T23:37:53.485] [718560.batch] error: *** JOB 718560 ON node38 CANCELLED AT 2018-06-28T23:37:53 ***
[2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69794 (charmm)
[2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69792 (mpirun)
[2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69791 (slurm_script)
[2018-06-28T23:37:53.488] [718560.batch] Sent signal 15 to 718560.4294967294
[2018-06-28T23:37:53.492] [718560.batch] task 0 (69791) exited. Killed by signal 15.
[2018-06-28T23:37:53.512] [718560.batch] job 718560 completed with slurm_rc = 0, job_rc = 15
[2018-06-28T23:37:53.512] [718560.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
[2018-06-28T23:37:53.516] [718560.batch] done with job

The slurm cluster has a minimal configuration:

ClusterName=cluster
ControlMachine=master
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
FastSchedule=1
SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm/
SlurmdSpoolDir=/var/spool/slurm/
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
Proctracktype=proctrack/linuxproc
ReturnToService=2
PropagatePrioProcess=0
PropagateResourceLimitsExcept=MEMLOCK
TaskPlugin=task/cgroup
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SlurmctldDebug=4
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=4
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_gather/cgroup
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=master
AccountingStorageLoc=all
NodeName=node[01-45] Sockets=2 CoresPerSocket=10 State=UNKNOWN
PartitionName=partition Nodes=ALL Default=YES MaxTime=INFINITE State=UP

Thank you for your help.