[slurm-users] Sinfo or squeue stuck for some seconds

navin srivastava navin.altair at gmail.com
Sun Aug 29 14:53:21 UTC 2021


Dear slurm community users,

We are  using slurm version 20.02.x.

We see the below message appearing a lot of times in slurmctld log
and found that whenever this message is appearing the sinfo/squeue out gets
slow.
No timeout as i kept the value 100.

Warning: Note very large processing time from load_part_uid_allow_list:
usec=10800885 began=16:27:55.952
[2021-08-29T16:28:06.753] Warning: Note very large processing time from
_slurmctld_background: usec=10801120 began=16:27:55.952

Is this a bug or some config issue. if anybody faced the similar
issue.could anybody throw some light on this.

please find the attached slurm.conf.

Regards
Navin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210829/c09ad60a/attachment.htm>
-------------- next part --------------
ClusterName=merckhpc
ControlMachine=Master
ControlAddr=localhost
AuthType=auth/munge
CredType=cred/munge
CacheGroups=1
ReturnToService=0
ProctrackType=proctrack/linuxproc
SlurmctldPort=6817
SlurmdPort=6818
SchedulerPort=7321

SlurmctldPidFile=/var/slurm/slurmctld.pid
SlurmdPidFile=/var/slurm/slurmd.%n.pid
SlurmdSpoolDir=/var/slurm/spool/slurmd.%n.spool
StateSaveLocation=/var/slurm/state
SlurmctldLogFile=/var/slurm/log/slurmctld.log
SlurmdLogFile=/var/slurm/log/slurmd.%n.log.%h
SlurmUser=hpcadmin
MpiDefault=none

SwitchType=switch/none
TaskPlugin=task/affinity
TaskPluginParam=Sched
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
KillWait=30
MinJobAge=3600


SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

AccountingStorageEnforce=associations
AccountingStorageHost=localhost
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES


JobCompType=jobcomp/slurmdbd
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmdDebug=5
SlurmctldDebug=5
Waittime=0

Epilog=/etc/slurm/slurm.epilog.clean
GresTypes=gpu
MaxArraySize=10000
MaxJobCount=5000000
MessageTimeout=100


SchedulerParameters=enable_user_top,default_queue_depth=1000000
PriorityType=priority/multifactor
PriorityDecayHalfLife=2
PriorityUsageResetPeriod=DAILY
PriorityWeightFairshare=500000
PriorityFlags=FAIR_TREE


NodeName=node[35-40] NodeHostname=bng1x[1847-1852] NodeAddr=node[35-40] CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=386626
NodeName=node[17-26] NodeHostName=bng1x[1590-1599] NodeAddr=node[17-26] CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=257680  Feature=K2200 Gres=gpu:2
NodeName=node41 NodeHostName=bng1x1855 NodeAddr=node41 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=386643 Feature=V100S Gres=gpu:2
NodeName=node[32-33] NodeHostname=bng1x[1793-1794] NodeAddr=node[32-33] Sockets=2 CoresPerSocket=24 RealMemory=773690
NodeName=node[28-31] NodeHostname=bng1x[1737-1740] NodeAddr=node[28-31] Sockets=2 CoresPerSocket=28 RealMemory=257586
NodeName=node[27] NodeHostname=bng1x1600 NodeAddr=node27 Sockets=2 CoresPerSocket=18 RealMemory=515728 Feature=K40 Gres=gpu:2
NodeName=node[34] NodeHostname=bng1x1795 NodeAddr=node34 Sockets=2 CoresPerSocket=24 RealMemory=773682 Feature=RTX Gres=gpu:8
PartitionName=Normal  Nodes=node[28-33,35-40]  Default=Yes MaxTime=INFINITE State=UP Shared=YES OverSubscribe=NO 
PartitionName=testq  Nodes=node41  Default=NO MaxTime=INFINITE State=UP Shared=YES
PartitionName=smallgpu Nodes=node[34]  Default=NO MaxTime=INFINITE State=UP Shared=YES OverSubscribe=NO 
PartitionName=biggpu  Nodes=node[17-27]  Default=NO MaxTime=INFINITE State=UP Shared=YES OverSubscribe=NO 


More information about the slurm-users mailing list