<div dir="ltr"><div>Matteo, a stupid question but if these are single CPU jobs why is mpirun being used?</div><div><br></div><div>Is your user using these 36 jobs to construct a parallel job to run charmm?</div><div>If the mpirun is killed, yes all the other processes which are started by it on the other compute nodes will be killed.</div><div><br></div><div>I suspect your user is trying to do womething "smart". You should give that person an example of how to reserve 36 cores and submit a charmm job.<br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 29 June 2018 at 12:13, Matteo Guglielmi <span dir="ltr"><<a href="mailto:Matteo.Guglielmi@dalco.ch" target="_blank">Matteo.Guglielmi@dalco.ch</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Dear comunity,<br>
<br>
I have a user who usually submits 36 (identical) jobs at a time using a simple for loop,<br>
thus jobs are sbatched all the same time.<br>
<br>
Each job requests a single core and all jobs are independent from one another (read<br>
different input files and write to different output files).<br>
<br>
Jobs are then usually started during the next couple of hours, somewhat at random<br>
times.<br>
<br>
What happens then is that after a certain amount of time (maybe from 2 to 12 hours)<br>
ALL jobs belonging to this particular user are killed by slurm on all nodes at exactly the<br>
same time.<br>
<br>
One example:<br>
<br>
### master: /var/log/slurmctld.log ###<br>
<br>
[2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560 InitPrio=4294185624 usec=255<br>
...<br>
[2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on node38<br>
...<br>
[2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560 uid 1007<br>
[2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560 State=0x8004 NodeCnt=1 successful 0x8004<br>
<br>
### node38: /var/log/slurmd.log ###<br>
<br>
[2018-06-28T19:29:05.410] _run_prolog: prolog with lock for job 718560 ran for 0 seconds<br>
[2018-06-28T19:29:05.410] Launching batch job 718560 for UID 1007<br>
[2018-06-28T19:29:05.427] [718560.batch] Munge cryptographic signature plugin loaded<br>
[2018-06-28T19:29:05.431] [718560.batch] debug level = 2<br>
[2018-06-28T19:29:05.431] [718560.batch] starting 1 tasks<br>
[2018-06-28T19:29:05.431] [718560.batch] task 0 (69791) started 2018-06-28T19:29:05<br>
[2018-06-28T19:29:05.440] [718560.batch] Can't propagate RLIMIT_NOFILE of 65536 from submit host: Operation not permitted<br>
...<br>
[2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69794 (charmm)<br>
[2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69792 (mpirun)<br>
[2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69791 (slurm_script)<br>
[2018-06-28T23:37:53.480] [718560.batch] Sent signal 18 to 718560.429496729<br>
[2018-06-28T23:37:53.485] [718560.batch] error: *** JOB 718560 ON node38 CANCELLED AT 2018-06-28T23:37:53 ***<br>
[2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69794 (charmm)<br>
[2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69792 (mpirun)<br>
[2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69791 (slurm_script)<br>
[2018-06-28T23:37:53.488] [718560.batch] Sent signal 15 to 718560.4294967294<br>
[2018-06-28T23:37:53.492] [718560.batch] task 0 (69791) exited. Killed by signal 15.<br>
[2018-06-28T23:37:53.512] [718560.batch] job 718560 completed with slurm_rc = 0, job_rc = 15<br>
[2018-06-28T23:37:53.512] [718560.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15<br>
[2018-06-28T23:37:53.516] [718560.batch] done with job<br>
<br>
The slurm cluster has a minimal configuration:<br>
<br>
ClusterName=cluster<br>
ControlMachine=master<br>
SchedulerType=sched/backfill<br>
SelectType=select/cons_res<br>
SelectTypeParameters=CR_Core<br>
FastSchedule=1<br>
SlurmUser=slurm<br>
SlurmdUser=root<br>
SlurmctldPort=6817<br>
SlurmdPort=6818<br>
AuthType=auth/munge<br>
StateSaveLocation=/var/spool/<wbr>slurm/<br>
SlurmdSpoolDir=/var/spool/<wbr>slurm/<br>
SwitchType=switch/none<br>
MpiDefault=none<br>
SlurmctldPidFile=/var/run/<wbr>slurmctld.pid<br>
SlurmdPidFile=/var/run/slurmd.<wbr>pid<br>
Proctracktype=proctrack/<wbr>linuxproc<br>
ReturnToService=2<br>
PropagatePrioProcess=0<br>
PropagateResourceLimitsExcept=<wbr>MEMLOCK<br>
TaskPlugin=task/cgroup<br>
SlurmctldTimeout=300<br>
SlurmdTimeout=300<br>
InactiveLimit=0<br>
MinJobAge=300<br>
KillWait=30<br>
Waittime=0<br>
SlurmctldDebug=4<br>
SlurmctldLogFile=/var/log/<wbr>slurmctld.log<br>
SlurmdDebug=4<br>
SlurmdLogFile=/var/log/slurmd.<wbr>log<br>
JobCompType=jobcomp/none<br>
JobAcctGatherType=jobacct_<wbr>gather/cgroup<br>
AccountingStorageType=<wbr>accounting_storage/slurmdbd<br>
AccountingStorageHost=master<br>
AccountingStorageLoc=all<br>
NodeName=node[01-45] Sockets=2 CoresPerSocket=10 State=UNKNOWN<br>
PartitionName=partition Nodes=ALL Default=YES MaxTime=INFINITE State=UP<br>
<br>
Thank you for your help.<br>
<br>
</blockquote></div><br></div>