[slurm-users] srun fails for certain partitions, but works with sbatch

Thu Feb 6 21:31:43 UTC 2020

Hi,

i have a node that is registered in two separate partitions: "ML" and "shared". i'm running 19-05-5-1.

everything in batch works well; i have users submitting into shared partition with QoS "scavenger" which is pre-emptable by "normal" QoS submissions. the default QoS is set up to be "scavenger" on "shared" and "normal" elsewhere. users can only submit to the shared partition with scavenger QoS.

however, users are getting stalls when running srun; but it only seems to occur for srun/salloc within the "shared" partition, but not with the "ML" partition. ie

$ srun -vv -A shared -p shared -w ml-gpu01 --pty /bin/bash
srun: defined options
srun: -------------------- --------------------
srun: account             : shared
srun: nodelist            : ml-gpu01
srun: partition           : shared
srun: pty                 : set
srun: verbose             : 2
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=8388608
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=4096
srun: debug:  propagating RLIMIT_NOFILE=1024
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  _is_port_ok: bind() failed port 61704 sock 6 Address already in use
srun: debug:  port from net_stream_listen is 61705
srun: debug:  Entering _msg_thr_internal
srun: debug:  _is_port_ok: bind() failed port 61704 sock 9 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61705 sock 9 Address already in use
srun: debug:  Munge authentication plugin loaded
srun: job 1432 queued and waiting for resources
<stalled>

$ scontrol show jobid 1432
JobId=1432 JobName=bash
   UserId=ytl(7017) GroupId=sf(1051) MCS_label=N/A
   Priority=9311 Nice=0 Account=shared QOS=scavenger
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=1:0
   RunTime=00:00:10 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=2020-02-05T19:48:08 EligibleTime=2020-02-05T19:48:08
   AccrueTime=2020-02-05T19:48:08
   StartTime=2020-02-05T19:48:09 EndTime=2020-02-05T19:48:19 Deadline=N/A
   PreemptEligibleTime=2020-02-05T19:48:09 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-02-05T19:48:09
   Partition=shared AllocNode:Sid=ocio-gpu01:28326
   ReqNodeList=ml-gpu01 ExcNodeList=(null)
   NodeList=ml-gpu01
   BatchHost=ml-gpu01
   NumNodes=1 NumCPUs=1 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=2000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/gpfs/slac/cryo/fs1/u/ytl
   Power=

however, if i submit into the ML partition (or indeed any other "normal" qos partition, it works fine...

$ srun -vv -A ml -p ml -w ml-gpu01 --pty /bin/bash
srun: defined options
srun: -------------------- --------------------
srun: account             : ml
srun: nodelist            : ml-gpu01
srun: partition           : ml
srun: pty                 : set
srun: verbose             : 2
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=8388608
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=4096
srun: debug:  propagating RLIMIT_NOFILE=1024
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  _is_port_ok: bind() failed port 61323 sock 6 Address already in use
srun: debug:  port from net_stream_listen is 61324
srun: debug:  Entering _msg_thr_internal
srun: debug:  _is_port_ok: bind() failed port 61323 sock 9 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61324 sock 9 Address already in use
srun: debug:  Munge authentication plugin loaded
srun: jobid 1434: nodes(1):`ml-gpu01', cpu counts: 1(x1)
srun: debug:  requesting job 1434, user 7017, nodes 1 including (ml-gpu01)
srun: debug:  cpus 1, tasks 1, name bash, relative 65534
srun: debug:  _is_port_ok: bind() failed port 61323 sock 9 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61324 sock 9 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61323 sock 10 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61324 sock 10 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61325 sock 10 Address already in use
srun: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  mpi type = (null)
srun: debug:  Using mpi/none
srun: debug:  Entering _msg_thr_create()
srun: debug:  _is_port_ok: bind() failed port 61323 sock 15 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61324 sock 15 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61325 sock 15 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61326 sock 15 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61323 sock 18 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61324 sock 18 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61325 sock 18 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61326 sock 18 Address already in use
srun: debug:  _is_port_ok: bind() failed port 61327 sock 18 Address already in use
srun: debug:  initialized stdio listening socket, port 61328
srun: debug:  Started IO server thread (140594085246720)
srun: debug:  Entering _launch_tasks
srun: launching 1434.0 on host ml-gpu01, 1 tasks: 0
srun: route default plugin loaded
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: Node ml-gpu01, 1 tasks started

$ scontrol show jobid 1434
JobId=1434 JobName=bash
   UserId=ytl(7017) GroupId=sf(1051) MCS_label=N/A
   Priority=17988 Nice=0 Account=ml QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:12 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=2020-02-05T19:53:50 EligibleTime=2020-02-05T19:53:50
   AccrueTime=Unknown
   StartTime=2020-02-05T19:53:50 EndTime=2020-02-05T23:53:50 Deadline=N/A
   PreemptEligibleTime=2020-02-05T19:53:50 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-02-05T19:53:50
   Partition=ml AllocNode:Sid=ocio-gpu01:28326
   ReqNodeList=ml-gpu01 ExcNodeList=(null)
   NodeList=ml-gpu01
   BatchHost=ml-gpu01
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/gpfs/slac/cryo/fs1/u/ytl
   Power=

i do not have any limits set in any of the QoS's yet.

i'm also pretty sure that there is no firewall in between the node where srun is and the the node the job lands on (ml-gpu01).

any ideas?

cheers,