[slurm-users] All user's jobs killed at the same time on all nodes

Fri Jun 29 11:34:09 MDT 2018

A couple comments/possible suggestions.

First, it looks to me that all the jobs are run from the same directory
with same input/output files.  Or am I missing something?

Also, what MPI library is being used?

I would suggest verifying if any of the jobs in question are terminating
normally.  I.e., is the mysterious issue which is causing all the user's
jobs to terminate triggered by the completion of one of the jobs.

I recall having an issue years ago with MPICH MPI libraries when having
multiple MPI jobs from the same user running on the same node.  IIRC, when
one job terminated (usually successfully), it would call mpdallexit, which
would happily kill all the mpds for that user on that node, making the
other MPI jobs that user had on that node quite unhappy.  The solution was
to set the environmental variable MPD_CON_EXT to unique values for each of
the jobs.  See e.g.
https://lists.mcs.anl.gov/pipermail/mpich-discuss/2008-May/003605.html

My users primarily use OpenMPI, and so do not have much recent experience
with this issue.  IIRC, this issue only impacted other MPI jobs running by
the same user on the same node, so a bit different than the symptoms as you
describe them (impacting all MPI jobs running by the same user on ANY
node), but as some similarity in the symptoms I thought I would mention it
anyway.

On Fri, Jun 29, 2018 at 7:24 AM, John Hearns <hearnsj at googlemail.com> wrote:

> I have got this all wrong. Paddy Doyle has got it right.
>
> However are you SURE than mpirun is not creating tasks on the other
> machines?
> I would look at the compute nodes while the job is running and do
> ps -eaf --forest
>
> Also using mpirun to run a single core gives me the heebie-jeebies...
>
> https://en.wikipedia.org/wiki/Heebie-jeebies_(idiom)
>
>
>
>
> On 29 June 2018 at 13:16, Matteo Guglielmi <Matteo.Guglielmi at dalco.ch>
> wrote:
>
>> You are right but I'm actually supporting the system administrator of
>> that cluster, I'll mention this to him.
>>
>> Beside that,
>>
>> the user runs this for loop to submit the jobs:
>>
>>
>> # submit.sh #
>>
>> typeset -i i=1
>> typeset -i j=12500  #number of frames goes to each core = number of
>> frames (1000000)/40 (cores) =
>> typeset -i k=1
>>
>> while [ $i -le 36 ]  #the number of frames
>> do
>>
>> sbatch run-5o$i.sh $i $j $k
>>
>> i=$i+1 # number of frames goes to each node (5*200 = 1000)
>> done
>>
>> where each run-5oXX.sh jobfile looks like this:
>>
>>
>> #!/bin/bash
>>
>> #SBATCH --job-name=charmm-test
>> #SBATCH --nodes=1
>> #SBATCH --ntasks=1
>> #SBATCH --cpus-per-task=1
>>
>> export PATH=/usr/lib64/openmpi/bin/:$PATH
>> export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
>>
>> mpirun -np 1 /opt/cluster/programs/charmm/c42b2/exec/gnu_M/charmm <
>> newphcnl99a0.inp > newphcnl99a0.out
>>
>>
>>
>>
>> so they are all independent mpiruns...  if one of them is killed, why
>> would all others go down as well?
>>
>>
>> That would make sense if a single mpirun is running 36 tasks... but the
>> user is not doing this.
>>
>> ________________________________
>> From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
>> John Hearns <hearnsj at googlemail.com>
>> Sent: Friday, June 29, 2018 12:52:41 PM
>> To: Slurm User Community List
>> Subject: Re: [slurm-users] All user's jobs killed at the same time on all
>> nodes
>>
>> Matteo, a stupid question but if these are single CPU jobs why is mpirun
>> being used?
>>
>> Is your user using these 36 jobs to construct a parallel job to run
>> charmm?
>> If the mpirun is killed, yes all the other processes which are started by
>> it on the other compute nodes will be killed.
>>
>> I suspect your user is trying to do womething "smart". You should give
>> that person an example of how to reserve 36 cores and submit a charmm job.
>>
>>
>> On 29 June 2018 at 12:13, Matteo Guglielmi <Matteo.Guglielmi at dalco.ch<mai
>> lto:Matteo.Guglielmi at dalco.ch>> wrote:
>> Dear comunity,
>>
>> I have a user who usually submits 36 (identical) jobs at a time using a
>> simple for loop,
>> thus jobs are sbatched all the same time.
>>
>> Each job requests a single core and all jobs are independent from one
>> another (read
>> different input files and write to different output files).
>>
>> Jobs are then usually started during the next couple of hours, somewhat
>> at random
>> times.
>>
>> What happens then is that after a certain amount of time (maybe from 2 to
>> 12 hours)
>> ALL jobs belonging to this particular user are killed by slurm on all
>> nodes at exactly the
>> same time.
>>
>> One example:
>>
>> ### master: /var/log/slurmctld.log ###
>>
>> [2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560
>> InitPrio=4294185624 usec=255
>> ...
>> [2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on
>> node38
>> ...
>> [2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job
>> 718560 uid 1007
>> [2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560
>> State=0x8004 NodeCnt=1 successful 0x8004
>>
>> ### node38: /var/log/slurmd.log ###
>>
>> [2018-06-28T19:29:05.410] _run_prolog: prolog with lock for job 718560
>> ran for 0 seconds
>> [2018-06-28T19:29:05.410] Launching batch job 718560 for UID 1007
>> [2018-06-28T19:29:05.427] [718560.batch] Munge cryptographic signature
>> plugin loaded
>> [2018-06-28T19:29:05.431] [718560.batch] debug level = 2
>> [2018-06-28T19:29:05.431] [718560.batch] starting 1 tasks
>> [2018-06-28T19:29:05.431] [718560.batch] task 0 (69791) started
>> 2018-06-28T19:29:05
>> [2018-06-28T19:29:05.440] [718560.batch] Can't propagate RLIMIT_NOFILE of
>> 65536 from submit host: Operation not permitted
>> ...
>> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69794
>> (charmm)
>> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69792
>> (mpirun)
>> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69791
>> (slurm_script)
>> [2018-06-28T23:37:53.480] [718560.batch] Sent signal 18 to
>> 718560.429496729
>> [2018-06-28T23:37:53.485] [718560.batch] error: *** JOB 718560 ON node38
>> CANCELLED AT 2018-06-28T23:37:53 ***
>> [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69794
>> (charmm)
>> [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69792
>> (mpirun)
>> [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69791
>> (slurm_script)
>> [2018-06-28T23:37:53.488] [718560.batch] Sent signal 15 to
>> 718560.4294967294
>> [2018-06-28T23:37:53.492] [718560.batch] task 0 (69791) exited. Killed by
>> signal 15.
>> [2018-06-28T23:37:53.512] [718560.batch] job 718560 completed with
>> slurm_rc = 0, job_rc = 15
>> [2018-06-28T23:37:53.512] [718560.batch] sending
>> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
>> [2018-06-28T23:37:53.516] [718560.batch] done with job
>>
>> The slurm cluster has a minimal configuration:
>>
>> ClusterName=cluster
>> ControlMachine=master
>> SchedulerType=sched/backfill
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_Core
>> FastSchedule=1
>> SlurmUser=slurm
>> SlurmdUser=root
>> SlurmctldPort=6817
>> SlurmdPort=6818
>> AuthType=auth/munge
>> StateSaveLocation=/var/spool/slurm/
>> SlurmdSpoolDir=/var/spool/slurm/
>> SwitchType=switch/none
>> MpiDefault=none
>> SlurmctldPidFile=/var/run/slurmctld.pid
>> SlurmdPidFile=/var/run/slurmd.pid
>> Proctracktype=proctrack/linuxproc
>> ReturnToService=2
>> PropagatePrioProcess=0
>> PropagateResourceLimitsExcept=MEMLOCK
>> TaskPlugin=task/cgroup
>> SlurmctldTimeout=300
>> SlurmdTimeout=300
>> InactiveLimit=0
>> MinJobAge=300
>> KillWait=30
>> Waittime=0
>> SlurmctldDebug=4
>> SlurmctldLogFile=/var/log/slurmctld.log
>> SlurmdDebug=4
>> SlurmdLogFile=/var/log/slurmd.log
>> JobCompType=jobcomp/none
>> JobAcctGatherType=jobacct_gather/cgroup
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStorageHost=master
>> AccountingStorageLoc=all
>> NodeName=node[01-45] Sockets=2 CoresPerSocket=10 State=UNKNOWN
>> PartitionName=partition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>
>> Thank you for your help.
>>
>>
>>
>>
>

-- 
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroads        payerle at umd.edu
5825 University Research Park               (301) 405-6135
University of Maryland
College Park, MD 20740-3831
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180629/d66a9b9f/attachment-0001.html>