[slurm-users] All user's jobs killed at the same time on all nodes

Mon Jul 2 04:37:13 MDT 2018

A great detective story!

> June15 but there is no trace of it anywhere on the disk.

Do you have the process ID (pid) of the watchdog.sh
You could look in /proc/(pid) /cmdline and see what that shows

On 2 July 2018 at 11:37, Matteo Guglielmi <Matteo.Guglielmi at dalco.ch> wrote:

> Unbelievable... and got it by chance.
>
> jobs were killed (again) at 21:04 and in the user's list of running
> processes there was a 'sleep 50000' command (13 hours + 53
> minutes + 20 seconds) which was fired up exactly at the same
> time.
>
> The watchdog.sh script (from which the sleep command is fired)
> was started on June15 but there is no trace of it anywhere on the
> disk.
>
> What's in that script I don't know but it kills all the users jobs
> almost twice a day... and I've waited for it to do it again this
> morning at 10:57... and sure enough all jobs disappeared and
> a new sleep 50000 command was fired.
>
> Thank you all anyway!
>
> -rw-rw-r-- 1 moha moha     117 Jul  1 21:04 slurm-764719.out
> -rw-rw-r-- 1 moha moha     117 Jul  1 21:04 slurm-764720.out
> -rw-rw-r-- 1 moha moha     117 Jul  1 21:04 slurm-764721.out
> -rw-rw-r-- 1 moha moha     117 Jul  1 21:04 slurm-764722.out
> -rw-rw-r-- 1 moha moha     117 Jul  1 21:04 slurm-764723.out
> -rw-rw-r-- 1 moha moha     117 Jul  1 21:04 slurm-764724.out
> -rw-rw-r-- 1 moha moha     117 Jul  1 21:04 slurm-764725.out
> -rw-rw-r-- 1 moha moha     117 Jul  1 21:04 slurm-764726.out
>
>
> [moha at master ~]$ ps aux | grep moha
> moha       1695  0.0  0.0 113128  1416 ?        S    Jun15   0:00 sh
> watchdog.sh
> moha      76720  0.0  0.0 150844  2696 ?        S    Jun28   0:00 sshd:
> moha at pts/10
> moha      76724  0.0  0.0 116692  3532 pts/10   Ss+  Jun28   0:00 -bash
> moha     149663  0.0  0.0 150400  2240 ?        S    Jun28   0:00 sshd:
> moha at pts/0
> moha     149664  0.0  0.0 116692  3536 pts/0    Ss+  Jun28   0:00 -bash
> moha     156670  0.0  0.0 150400  2236 ?        S    Jun28   0:00 sshd:
> moha at pts/5
> moha     156671  0.0  0.0 116692  3604 pts/5    Ss+  Jun28   0:00 -bash
> moha     164364  0.0  0.0 107904   608 ?        S    21:04   0:00 sleep
> 50000         <<<<<<<<<<=========== !!!!
> moha     190871  0.0  0.0 116684  3472 pts/4    S    21:46   0:00 -bash
> moha     194080  0.0  0.0 151060  1820 pts/4    R+   21:52   0:00 ps aux
> moha     194081  0.0  0.0 112664   972 pts/4    S+   21:52   0:00 grep
> --color=auto moha
>
>
> ________________________________
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Thomas M. Payerle <payerle at umd.edu>
> Sent: Friday, June 29, 2018 7:34:09 PM
> To: Slurm User Community List
> Subject: Re: [slurm-users] All user's jobs killed at the same time on all
> nodes
>
> A couple comments/possible suggestions.
>
> First, it looks to me that all the jobs are run from the same directory
> with same input/output files.  Or am I missing something?
>
> Also, what MPI library is being used?
>
> I would suggest verifying if any of the jobs in question are terminating
> normally.  I.e., is the mysterious issue which is causing all the user's
> jobs to terminate triggered by the completion of one of the jobs.
>
> I recall having an issue years ago with MPICH MPI libraries when having
> multiple MPI jobs from the same user running on the same node.  IIRC, when
> one job terminated (usually successfully), it would call mpdallexit, which
> would happily kill all the mpds for that user on that node, making the
> other MPI jobs that user had on that node quite unhappy.  The solution was
> to set the environmental variable MPD_CON_EXT to unique values for each of
> the jobs.  See e.g. https://lists.mcs.anl.gov/
> pipermail/mpich-discuss/2008-May/003605.html
>
> My users primarily use OpenMPI, and so do not have much recent experience
> with this issue.  IIRC, this issue only impacted other MPI jobs running by
> the same user on the same node, so a bit different than the symptoms as you
> describe them (impacting all MPI jobs running by the same user on ANY
> node), but as some similarity in the symptoms I thought I would mention it
> anyway.
>
>
> On Fri, Jun 29, 2018 at 7:24 AM, John Hearns <hearnsj at googlemail.com<
> mailto:hearnsj at googlemail.com>> wrote:
> I have got this all wrong. Paddy Doyle has got it right.
>
> However are you SURE than mpirun is not creating tasks on the other
> machines?
> I would look at the compute nodes while the job is running and do
> ps -eaf --forest
>
> Also using mpirun to run a single core gives me the heebie-jeebies...
>
> https://en.wikipedia.org/wiki/Heebie-jeebies_(idiom)
>
>
>
>
> On 29 June 2018 at 13:16, Matteo Guglielmi <Matteo.Guglielmi at dalco.ch<
> mailto:Matteo.Guglielmi at dalco.ch>> wrote:
> You are right but I'm actually supporting the system administrator of that
> cluster, I'll mention this to him.
>
> Beside that,
>
> the user runs this for loop to submit the jobs:
>
>
> # submit.sh #
>
> typeset -i i=1
> typeset -i j=12500  #number of frames goes to each core = number of frames
> (1000000)/40 (cores) =
> typeset -i k=1
>
> while [ $i -le 36 ]  #the number of frames
> do
>
> sbatch run-5o$i.sh $i $j $k
>
> i=$i+1 # number of frames goes to each node (5*200 = 1000)
> done
>
> where each run-5oXX.sh jobfile looks like this:
>
>
> #!/bin/bash
>
> #SBATCH --job-name=charmm-test
> #SBATCH --nodes=1
> #SBATCH --ntasks=1
> #SBATCH --cpus-per-task=1
>
> export PATH=/usr/lib64/openmpi/bin/:$PATH
> export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
>
> mpirun -np 1 /opt/cluster/programs/charmm/c42b2/exec/gnu_M/charmm <
> newphcnl99a0.inp > newphcnl99a0.out
>
>
>
>
> so they are all independent mpiruns...  if one of them is killed, why
> would all others go down as well?
>
>
> That would make sense if a single mpirun is running 36 tasks... but the
> user is not doing this.
>
> ________________________________
> From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-
> users-bounces at lists.schedmd.com>> on behalf of John Hearns <
> hearnsj at googlemail.com<mailto:hearnsj at googlemail.com>>
> Sent: Friday, June 29, 2018 12:52:41 PM
> To: Slurm User Community List
> Subject: Re: [slurm-users] All user's jobs killed at the same time on all
> nodes
>
> Matteo, a stupid question but if these are single CPU jobs why is mpirun
> being used?
>
> Is your user using these 36 jobs to construct a parallel job to run charmm?
> If the mpirun is killed, yes all the other processes which are started by
> it on the other compute nodes will be killed.
>
> I suspect your user is trying to do womething "smart". You should give
> that person an example of how to reserve 36 cores and submit a charmm job.
>
>
> On 29 June 2018 at 12:13, Matteo Guglielmi <Matteo.Guglielmi at dalco.ch<
> mailto:Matteo.Guglielmi at dalco.ch><mailto:Matteo.Guglielmi at dalco.ch<mailto:
> Matteo.Guglielmi at dalco.ch>>> wrote:
> Dear comunity,
>
> I have a user who usually submits 36 (identical) jobs at a time using a
> simple for loop,
> thus jobs are sbatched all the same time.
>
> Each job requests a single core and all jobs are independent from one
> another (read
> different input files and write to different output files).
>
> Jobs are then usually started during the next couple of hours, somewhat at
> random
> times.
>
> What happens then is that after a certain amount of time (maybe from 2 to
> 12 hours)
> ALL jobs belonging to this particular user are killed by slurm on all
> nodes at exactly the
> same time.
>
> One example:
>
> ### master: /var/log/slurmctld.log ###
>
> [2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560
> InitPrio=4294185624 usec=255
> ...
> [2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on
> node38
> ...
> [2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560
> uid 1007
> [2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560
> State=0x8004 NodeCnt=1 successful 0x8004
>
> ### node38: /var/log/slurmd.log ###
>
> [2018-06-28T19:29:05.410] _run_prolog: prolog with lock for job 718560 ran
> for 0 seconds
> [2018-06-28T19:29:05.410] Launching batch job 718560 for UID 1007
> [2018-06-28T19:29:05.427] [718560.batch] Munge cryptographic signature
> plugin loaded
> [2018-06-28T19:29:05.431] [718560.batch] debug level = 2
> [2018-06-28T19:29:05.431] [718560.batch] starting 1 tasks
> [2018-06-28T19:29:05.431] [718560.batch] task 0 (69791) started
> 2018-06-28T19:29:05
> [2018-06-28T19:29:05.440] [718560.batch] Can't propagate RLIMIT_NOFILE of
> 65536 from submit host: Operation not permitted
> ...
> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69794
> (charmm)
> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69792
> (mpirun)
> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69791
> (slurm_script)
> [2018-06-28T23:37:53.480] [718560.batch] Sent signal 18 to 718560.429496729
> [2018-06-28T23:37:53.485] [718560.batch] error: *** JOB 718560 ON node38
> CANCELLED AT 2018-06-28T23:37:53 ***
> [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69794
> (charmm)
> [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69792
> (mpirun)
> [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69791
> (slurm_script)
> [2018-06-28T23:37:53.488] [718560.batch] Sent signal 15 to
> 718560.4294967294
> [2018-06-28T23:37:53.492] [718560.batch] task 0 (69791) exited. Killed by
> signal 15.
> [2018-06-28T23:37:53.512] [718560.batch] job 718560 completed with
> slurm_rc = 0, job_rc = 15
> [2018-06-28T23:37:53.512] [718560.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
> [2018-06-28T23:37:53.516] [718560.batch] done with job
>
> The slurm cluster has a minimal configuration:
>
> ClusterName=cluster
> ControlMachine=master
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core
> FastSchedule=1
> SlurmUser=slurm
> SlurmdUser=root
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> StateSaveLocation=/var/spool/slurm/
> SlurmdSpoolDir=/var/spool/slurm/
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmdPidFile=/var/run/slurmd.pid
> Proctracktype=proctrack/linuxproc
> ReturnToService=2
> PropagatePrioProcess=0
> PropagateResourceLimitsExcept=MEMLOCK
> TaskPlugin=task/cgroup
> SlurmctldTimeout=300
> SlurmdTimeout=300
> InactiveLimit=0
> MinJobAge=300
> KillWait=30
> Waittime=0
> SlurmctldDebug=4
> SlurmctldLogFile=/var/log/slurmctld.log
> SlurmdDebug=4
> SlurmdLogFile=/var/log/slurmd.log
> JobCompType=jobcomp/none
> JobAcctGatherType=jobacct_gather/cgroup
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageHost=master
> AccountingStorageLoc=all
> NodeName=node[01-45] Sockets=2 CoresPerSocket=10 State=UNKNOWN
> PartitionName=partition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
> Thank you for your help.
>
>
>
>
>
>
>
> --
> Tom Payerle
> DIT-ACIGS/Mid-Atlantic Crossroads        payerle at umd.edu<mailto:payerle
> @umd.edu>
> 5825 University Research Park               (301) 405-6135
> University of Maryland
> College Park, MD 20740-3831
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180702/522c62e3/attachment-0001.html>