[slurm-users] sbatch tasks stuck in queue when a job is hung
Robert Kudyba
rkudyba at fordham.edu
Fri Aug 30 13:57:09 UTC 2019
I had set RealMemory to a really high number as I mis-interpreted the
recommendation.
NodeName=node[001-003] CoresPerSocket=12 RealMemory= 196489092 Sockets=2
Gres=gpu:1
But now I set it to:
RealMemory=191000
I restarted slurmctld. And according to the Bright Cluster support team:
"Unless it has been overridden in the image, the nodes will have a symlink
directly to the slurm.conf on the head node. This means that any changes
made to the file on the head node will automatically be available to the
compute nodes. All they would need in that case is to have slurmd restarted"
But now I see these errors:
mcs: MCSParameters = (null). ondemand set.
[2019-08-30T09:22:41.700] error: Node node001 appears to have a different
slurm.conf than the slurmctld. This could cause issues with communication
and functionality. Please review both files and make sure they are the
same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2019-08-30T09:22:41.700] error: Node node002 appears to have a different
slurm.conf than the slurmctld. This could cause issues with communication
and functionality. Please review both files and make sure they are the
same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2019-08-30T09:22:41.701] error: Node node003 appears to have a different
slurm.conf than the slurmctld. This could cause issues with communication
and functionality. Please review both files and make sure they are the
same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.
[2019-08-30T09:23:16.347] update_node: node node001 state set to IDLE
[2019-08-30T09:23:16.347] got (nil)
[2019-08-30T09:23:16.766]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2019-08-30T09:23:19.082] update_node: node node002 state set to IDLE
[2019-08-30T09:23:19.082] got (nil)
[2019-08-30T09:23:20.929] update_node: node node003 state set to IDLE
[2019-08-30T09:23:20.929] got (nil)
[2019-08-30T09:45:46.314] _slurm_rpc_submit_batch_job: JobId=449
InitPrio=4294901759 usec=355
[2019-08-30T09:45:46.430] sched: Allocate JobID=449 NodeList=node[001-003]
#CPUs=30 Partition=defq
[2019-08-30T09:45:46.670] prolog_running_decr: Configuration for JobID=449
is complete
[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x1 NodeCnt=3
WEXITSTATUS 127
[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x8005 NodeCnt=3
done
Is this another option that needs to be set?
On Thu, Aug 29, 2019 at 3:27 PM Alex Chekholko <alex at calicolabs.com> wrote:
> Sounds like maybe you didn't correctly roll out / update your slurm.conf
> everywhere as your RealMemory value is back to your large wrong number.
> You need to update your slurm.conf everywhere and restart all the slurm
> daemons.
>
> I recommend the "safe procedure" from here:
> https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.fysik.dtu.dk_niflheim_SLURM-23add-2Dand-2Dremove-2Dnodes&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yUZtCS8lFs9N4Dm1nidebq1bpGa9QMJUap7ZWVR8NVg&s=Fq72zWoETitTA7ayJCyYkbp8E1fInntp4YeBv75o7vU&e=>
> Your Bright manual may have a similar process for updating SLURM config
> "the Bright way".
>
> On Thu, Aug 29, 2019 at 12:20 PM Robert Kudyba <rkudyba at fordham.edu>
> wrote:
>
>> I thought I had taken care of this a while back but it appears the issue
>> has returned. A very simply sbatch slurmhello.sh:
>> cat slurmhello.sh
>> #!/bin/sh
>> #SBATCH -o my.stdout
>> #SBATCH -N 3
>> #SBATCH --ntasks=16
>> module add shared openmpi/gcc/64/1.10.7 slurm
>> mpirun hello
>>
>> sbatch slurmhello.sh
>> Submitted batch job 419
>>
>> squeue
>> JOBID PARTITION NAME USER ST TIME NODES
>> NODELIST(REASON)
>> 419 defq slurmhel root PD 0:00 3
>> (Resources)
>>
>> In /etc/slurm/slurm.conf:
>> # Nodes
>> NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092 Sockets=2
>> Gres=gpu:1
>>
>> Logs show:
>> [2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
>> node=node001: Invalid argument
>> [2019-08-29T14:24:40.025] error: Node node002 has low real_memory size
>> (191840 < 196489092)
>> [2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
>> node=node002: Invalid argument
>> [2019-08-29T14:24:40.026] error: Node node003 has low real_memory size
>> (191840 < 196489092)
>> [2019-08-29T14:24:40.026] error: _slurm_rpc_node_registration
>> node=node003: Invalid argument
>>
>> scontrol show jobid -dd 419
>> JobId=419 JobName=slurmhello.sh
>> UserId=root(0) GroupId=root(0) MCS_label=N/A
>> Priority=4294901759 Nice=0 Account=root QOS=normal
>> JobState=PENDING Reason=Resources Dependency=(null)
>> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>> DerivedExitCode=0:0
>> RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>> SubmitTime=2019-08-28T09:54:22 EligibleTime=2019-08-28T09:54:22
>> StartTime=Unknown EndTime=Unknown Deadline=N/A
>> PreemptTime=None SuspendTime=None SecsPreSuspend=0
>> LastSchedEval=2019-08-28T09:57:22
>> Partition=defq AllocNode:Sid=ourcluster:194152
>> ReqNodeList=(null) ExcNodeList=(null)
>> NodeList=(null)
>> NumNodes=3-3 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>> TRES=cpu=16,node=3
>> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>> MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>> Features=(null) DelayBoot=00:00:00
>> Gres=(null) Reservation=(null)
>> OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
>> Command=/root/slurmhello.sh
>> WorkDir=/root
>> StdErr=/root/my.stdout
>> StdIn=/dev/null
>> StdOut=/root/my.stdout
>> Power=
>>
>> scontrol show nodes node001
>> NodeName=node001 Arch=x86_64 CoresPerSocket=12
>> CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.06
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=gpu:1
>> NodeAddr=node001 NodeHostName=node001 Version=17.11
>> OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
>> RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2 Boards=1
>> State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
>> MCS_label=N/A
>> Partitions=defq
>> BootTime=2019-07-18T12:08:41 SlurmdStartTime=2019-07-18T12:09:44
>> CfgTRES=cpu=24,mem=196489092M,billing=24
>> AllocTRES=
>> CapWatts=n/a
>> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>> Reason=Low RealMemory [slurm at 2019-07-18T10:17:24]
>>
>> [root at ciscluster ~]# scontrol show nodes| grep -i mem
>> RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2 Boards=1
>> CfgTRES=cpu=24,mem=196489092M,billing=24
>> Reason=Low RealMemory [slurm at 2019-07-18T10:17:24]
>> RealMemory=196489092 AllocMem=0 FreeMem=180969 Sockets=2 Boards=1
>> CfgTRES=cpu=24,mem=196489092M,billing=24
>> Reason=Low RealMemory [slurm at 2019-07-18T10:17:24]
>> RealMemory=196489092 AllocMem=0 FreeMem=178999 Sockets=2 Boards=1
>> CfgTRES=cpu=24,mem=196489092M,billing=24
>> Reason=Low RealMemory [slurm at 2019-07-18T10:17:24]
>>
>> sinfo -R
>> REASON USER TIMESTAMP NODELIST
>> Low RealMemory slurm 2019-07-18T10:17:24 node[001-003]
>>
>> sinfo -N
>> NODELIST NODES PARTITION STATE
>> node001 1 defq* drain
>> node002 1 defq* drain
>> node003 1 defq* drain
>>
>> pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
>> node002: Thread(s) per core: 1
>> node002: Core(s) per socket: 12
>> node002: Socket(s): 2
>> node001: Thread(s) per core: 1
>> node001: Core(s) per socket: 12
>> node001: Socket(s): 2
>> node003: Thread(s) per core: 2
>> node003: Core(s) per socket: 12
>> node003: Socket(s): 2
>>
>> scontrol show nodes| grep -i mem
>> RealMemory=196489092 AllocMem=0 FreeMem=100054 Sockets=2 Boards=1
>> CfgTRES=cpu=24,mem=196489092M,billing=24
>> Reason=Low RealMemory [slurm at 2019-07-18T10:17:24]
>> RealMemory=196489092 AllocMem=0 FreeMem=181101 Sockets=2 Boards=1
>> CfgTRES=cpu=24,mem=196489092M,billing=24
>> Reason=Low RealMemory [slurm at 2019-07-18T10:17:24]
>> RealMemory=196489092 AllocMem=0 FreeMem=179004 Sockets=2 Boards=1
>> CfgTRES=cpu=24,mem=196489092M,billing=24
>> Reason=Low RealMemory
>>
>> Does anything look off?
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190830/60a0520b/attachment.htm>
More information about the slurm-users
mailing list