Hello
I am writing to report an issue with the Slurmctld process on our RHEL 9 (Rocky Linux) .
Twice in the past 5 days, the Slurmctld process has encountered an error that resulted in the service stopping. The error message displayed was "double free or corruption (out)". This error has caused significant disruption to our jobs, and we are concerned about its recurrence.
We have tried troubleshooting the issue, but we have not been able to identify the root cause of the problem. We would appreciate any assistance or guidance you can provide to help us resolve this issue.
Please let us know if you need any additional information or if there are any specific steps we should take to diagnose the problem further.
Thank you for your attention to this matter.
Best regards,
_________________________
Jul 09 22:12:01 admin slurmctld[711010]: double free or corruption (fasttop) Jul 09 22:12:01 admin systemd[1]: slurmctld.service: Main process exited, code=killed, status=6/ABRT Jul 09 22:12:01 admin systemd[1]: slurmctld.service: Failed with result 'signal'. Jul 09 22:12:01 admin systemd[1]: slurmctld.service: Consumed 11min 26.451s CPU time.
.....
Jul 14 10:15:01 admin slurmctld[1633720]: double free or corruption (out) Jul 14 10:15:02 admin systemd[1]: slurmctld.service: Main process exited, code=killed, status=6/ABRT Jul 14 10:15:02 admin systemd[1]: slurmctld.service: Failed with result 'signal'. Jul 14 10:15:02 admin systemd[1]: slurmctld.service: Consumed 7min 27.596s CPU time.
_________________________
slurmctld -V slurm 22.05.9
________________________
cat /etc/slurm/slurm.conf |grep -v '#'
ClusterName=xxx SlurmctldHost=admin SlurmctldParameters=enable_configless SlurmUser=slurm AuthType=auth/munge CryptoType=crypto/munge
SlurmctldPort=6817 StateSaveLocation=/var/spool/slurmctld SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmctldDebug=verbose DebugFlags=NO_CONF_HASH
SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmdLogFile=/var/log/slurm/slurmd.log SlurmdDebug=verbose
SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core,CR_LLN DefMemPerCPU=1024 MaxMemPerCPU=4096 GresTypes=gpu
ProctrackType=proctrack/cgroup JobAcctGatherType=jobacct_gather/cgroup JobAcctGatherFrequency=15 JobCompType=jobcomp/none
TaskPlugin=task/cgroup LaunchParameters=use_interactive_step
AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=admin AccountingStoragePort=6819 AccountingStorageEnforce=associations AccountingStorageTRES=gres/gpu
MailProg=/usr/bin/mailx EnforcePartLimits=YES MaxArraySize=200000 MaxJobCount=500000 MpiDefault=none ReturnToService=2 SwitchType=switch/none TmpFS=/tmpslurm/ UsePAM=1
InactiveLimit=0 KillWait=30 MessageTimeout=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0
PriorityType=priority/multifactor PriorityFlags=FAIR_TREE,MAX_TRES PriorityDecayHalfLife=1-0 PriorityWeightFairshare=10000
NodeName=xxx NodeHostname=xxx CPUs=4 Sockets=4 RealMemory=3500 TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN NodeName=xxx NodeHostname=xxx CPUs=2 Sockets=2 RealMemory=1700 TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN NodeName=xxx NodeHostname=xxx CPUs=4 Sockets=4 RealMemory=1700 TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN NodeName=xxx NodeHostname=xxx CPUs=4 Sockets=4 RealMemory=3500 TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN
NodeName=r9nc-24-[1-12] NodeHostname=r9nc-24-[1-12] Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 CPUs=24 RealMemory=180000 State=UNKNOWN NodeName=r9nc-48-[1-4] NodeHostname=r9nc-48-[1-4] Sockets=2 CoresPerSocket=24 ThreadsPerCore=1 CPUs=48 RealMemory=480000 State=UNKNOWN NodeName=r9ng-1080-[1-7] NodeHostname=r9ng-1080-[1-7] Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 CPUs=20 RealMemory=180000 State=UNKNOWN Gres=gpu:1080ti:4 NodeName=r9ng-1080-8 NodeHostname=r9ng-1080-8 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 CPUs=20 RealMemory=176687 State=UNKNOWN Gres=gpu:1080ti:1
PartitionName=24CPUNodes Nodes=r9nc-24-[1-12] State=UP MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=7500 DefMemPerCPU=7500 TRESBillingWeights="CPU=1.0,Mem=0.125G" Default=YES PartitionName=48CPUNodes Nodes=r9nc-48-[1-4] State=UP MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=10000 DefMemPerCPU=8000 TRESBillingWeights="CPU=1.0,Mem=0.125G" PartitionName=GPUNodes Nodes=r9ng-1080-[1-7] State=UP MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=9000 DefMemPerCPU=9000 PartitionName=GPUNodes1080-dev Nodes=r9ng-1080-8 State=UP MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=9000 DefMemPerCPU=9000 Hidden=Yes
_________________________
sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 24CPUNodes* up infinite 12 idle r9nc-24-[1-12] 48CPUNodes up infinite 2 idle r9nc-48-[1-2] GPUNodes up infinite 4 idle r9ng-1080-[4-7] GPUNodes1080-dev up infinite 1 idle r9ng-1080-8