[slurm-users] sbatch tasks stuck in queue when a job is hung

Fri Aug 30 16:06:30 UTC 2019

After you restart slurmctld do "scontrol reconfigure"

Brian Andrus

On 8/30/2019 6:57 AM, Robert Kudyba wrote:
> I had set RealMemory to a really high number as I mis-interpreted the 
> recommendation.
> NodeName=node[001-003]  CoresPerSocket=12 RealMemory= 
> 196489092  Sockets=2 Gres=gpu:1
>
> But now I set it to:
> RealMemory=191000
>
> I restarted slurmctld. And according to the Bright Cluster support team:
> "Unless it has been overridden in the image, the nodes will have a 
> symlink directly to the slurm.conf on the head node. This means that 
> any changes made to the file on the head node will automatically be 
> available to the compute nodes. All they would need in that case is to 
> have slurmd restarted"
>
> But now I see these errors:
>
> mcs: MCSParameters = (null). ondemand set.
> [2019-08-30T09:22:41.700] error: Node node001 appears to have a 
> different slurm.conf than the slurmctld.  This could cause issues with 
> communication and functionality.  Please review both files and make 
> sure they are the same.  If this is expected ignore, and set 
> DebugFlags=NO_CONF_HASH in your slurm.conf.
> [2019-08-30T09:22:41.700] error: Node node002 appears to have a 
> different slurm.conf than the slurmctld.  This could cause issues with 
> communication and functionality.  Please review both files and make 
> sure they are the same.  If this is expected ignore, and set 
> DebugFlags=NO_CONF_HASH in your slurm.conf.
> [2019-08-30T09:22:41.701] error: Node node003 appears to have a 
> different slurm.conf than the slurmctld.  This could cause issues with 
> communication and functionality.  Please review both files and make 
> sure they are the same.  If this is expected ignore, and set 
> DebugFlags=NO_CONF_HASH in your slurm.conf.
> [2019-08-30T09:23:16.347] update_node: node node001 state set to IDLE
> [2019-08-30T09:23:16.347] got (nil)
> [2019-08-30T09:23:16.766] 
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
> [2019-08-30T09:23:19.082] update_node: node node002 state set to IDLE
> [2019-08-30T09:23:19.082] got (nil)
> [2019-08-30T09:23:20.929] update_node: node node003 state set to IDLE
> [2019-08-30T09:23:20.929] got (nil)
> [2019-08-30T09:45:46.314] _slurm_rpc_submit_batch_job: JobId=449 
> InitPrio=4294901759 usec=355
> [2019-08-30T09:45:46.430] sched: Allocate JobID=449 
> NodeList=node[001-003] #CPUs=30 Partition=defq
> [2019-08-30T09:45:46.670] prolog_running_decr: Configuration for 
> JobID=449 is complete
> [2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x1 NodeCnt=3 
> WEXITSTATUS 127
> [2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x8005 
> NodeCnt=3 done
>
> Is this another option that needs to be set?
>
> On Thu, Aug 29, 2019 at 3:27 PM Alex Chekholko <alex at calicolabs.com 
> <mailto:alex at calicolabs.com>> wrote:
>
>     Sounds like maybe you didn't correctly roll out / update your
>     slurm.conf everywhere as your RealMemory value is back to your
>     large wrong number.  You need to update your slurm.conf everywhere
>     and restart all the slurm daemons.
>
>     I recommend the "safe procedure" from here:
>     https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes
>     <https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.fysik.dtu.dk_niflheim_SLURM-23add-2Dand-2Dremove-2Dnodes&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yUZtCS8lFs9N4Dm1nidebq1bpGa9QMJUap7ZWVR8NVg&s=Fq72zWoETitTA7ayJCyYkbp8E1fInntp4YeBv75o7vU&e=>
>     Your Bright manual may have a similar process for updating SLURM
>     config "the Bright way".
>
>     On Thu, Aug 29, 2019 at 12:20 PM Robert Kudyba
>     <rkudyba at fordham.edu <mailto:rkudyba at fordham.edu>> wrote:
>
>         I thought I had taken care of this a while back but it appears
>         the issue has returned. A very simply sbatch slurmhello.sh:
>          cat slurmhello.sh
>         #!/bin/sh
>         #SBATCH -o my.stdout
>         #SBATCH -N 3
>         #SBATCH --ntasks=16
>         module add shared openmpi/gcc/64/1.10.7 slurm
>         mpirun hello
>
>         sbatch slurmhello.sh
>         Submitted batch job 419
>
>         squeue
>                      JOBID PARTITION     NAME     USER ST  TIME  NODES
>         NODELIST(REASON)
>                        419      defq slurmhel     root PD  0:00      3
>         (Resources)
>
>         In /etc/slurm/slurm.conf:
>         # Nodes
>         NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092
>         Sockets=2 Gres=gpu:1
>
>         Logs show:
>         [2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
>         node=node001: Invalid argument
>         [2019-08-29T14:24:40.025] error: Node node002 has low
>         real_memory size (191840 < 196489092)
>         [2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
>         node=node002: Invalid argument
>         [2019-08-29T14:24:40.026] error: Node node003 has low
>         real_memory size (191840 < 196489092)
>         [2019-08-29T14:24:40.026] error: _slurm_rpc_node_registration
>         node=node003: Invalid argument
>
>         scontrol show jobid -dd 419
>         JobId=419 JobName=slurmhello.sh
>            UserId=root(0) GroupId=root(0) MCS_label=N/A
>            Priority=4294901759 Nice=0 Account=root QOS=normal
>            JobState=PENDING Reason=Resources Dependency=(null)
>            Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>            DerivedExitCode=0:0
>            RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>            SubmitTime=2019-08-28T09:54:22 EligibleTime=2019-08-28T09:54:22
>            StartTime=Unknown EndTime=Unknown Deadline=N/A
>            PreemptTime=None SuspendTime=None SecsPreSuspend=0
>            LastSchedEval=2019-08-28T09:57:22
>            Partition=defq AllocNode:Sid=ourcluster:194152
>            ReqNodeList=(null) ExcNodeList=(null)
>            NodeList=(null)
>            NumNodes=3-3 NumCPUs=16 NumTasks=16 CPUs/Task=1
>         ReqB:S:C:T=0:0:*:*
>            TRES=cpu=16,node=3
>            Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>            MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>            Features=(null) DelayBoot=00:00:00
>            Gres=(null) Reservation=(null)
>            OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
>            Command=/root/slurmhello.sh
>            WorkDir=/root
>            StdErr=/root/my.stdout
>            StdIn=/dev/null
>            StdOut=/root/my.stdout
>            Power=
>
>         scontrol show nodes node001
>         NodeName=node001 Arch=x86_64 CoresPerSocket=12
>            CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.06
>            AvailableFeatures=(null)
>            ActiveFeatures=(null)
>            Gres=gpu:1
>            NodeAddr=node001 NodeHostName=node001 Version=17.11
>            OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9
>         18:05:47 UTC 2018
>            RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2
>         Boards=1
>            State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
>         Owner=N/A MCS_label=N/A
>            Partitions=defq
>            BootTime=2019-07-18T12:08:41
>         SlurmdStartTime=2019-07-18T12:09:44
>            CfgTRES=cpu=24,mem=196489092M,billing=24
>            AllocTRES=
>            CapWatts=n/a
>            CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>            ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>            Reason=Low RealMemory [slurm at 2019-07-18T10:17:24]
>
>         [root at ciscluster ~]# scontrol show nodes| grep -i mem
>            RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2
>         Boards=1
>            CfgTRES=cpu=24,mem=196489092M,billing=24
>            Reason=Low RealMemory [slurm at 2019-07-18T10:17:24]
>            RealMemory=196489092 AllocMem=0 FreeMem=180969 Sockets=2
>         Boards=1
>            CfgTRES=cpu=24,mem=196489092M,billing=24
>            Reason=Low RealMemory [slurm at 2019-07-18T10:17:24]
>            RealMemory=196489092 AllocMem=0 FreeMem=178999 Sockets=2
>         Boards=1
>            CfgTRES=cpu=24,mem=196489092M,billing=24
>            Reason=Low RealMemory [slurm at 2019-07-18T10:17:24]
>
>         sinfo -R
>         REASON               USER      TIMESTAMP  NODELIST
>         Low RealMemory       slurm     2019-07-18T10:17:24 node[001-003]
>
>         sinfo -N
>         NODELIST   NODES PARTITION STATE
>         node001        1     defq* drain
>         node002        1     defq* drain
>         node003        1     defq* drain
>
>         pdsh -w node00[1-3]  "lscpu | grep -iE 'socket|core'"
>         node002: Thread(s) per core:    1
>         node002: Core(s) per socket:    12
>         node002: Socket(s):             2
>         node001: Thread(s) per core:    1
>         node001: Core(s) per socket:    12
>         node001: Socket(s):             2
>         node003: Thread(s) per core:    2
>         node003: Core(s) per socket:    12
>         node003: Socket(s):             2
>
>         scontrol show nodes| grep -i mem
>            RealMemory=196489092 AllocMem=0 FreeMem=100054 Sockets=2
>         Boards=1
>            CfgTRES=cpu=24,mem=196489092M,billing=24
>            Reason=Low RealMemory [slurm at 2019-07-18T10:17:24]
>            RealMemory=196489092 AllocMem=0 FreeMem=181101 Sockets=2
>         Boards=1
>            CfgTRES=cpu=24,mem=196489092M,billing=24
>            Reason=Low RealMemory [slurm at 2019-07-18T10:17:24]
>            RealMemory=196489092 AllocMem=0 FreeMem=179004 Sockets=2
>         Boards=1
>            CfgTRES=cpu=24,mem=196489092M,billing=24
>            Reason=Low RealMemory
>
>         Does anything look off?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190830/5a6ec7c3/attachment.htm>