Hello,
after adding "EnforcePartLimits=ALL" in slurm.conf and restarting slurmctld daemon, job continues being accepted... so I don't undertand where I'm doing some wrong.
My slurm.conf is this: ControlMachine=my_server MailProg=/bin/mail MpiDefault=none ProctrackType=proctrack/linuxproc ReturnToService=2 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm SlurmdUser=root AuthType=auth/munge StateSaveLocation=/var/log/slurm SwitchType=switch/none TaskPlugin=task/none,task/affinity,task/cgroup TaskPluginParam=none DebugFlags=NO_CONF_HASH,Backfill,BackfillMap,SelectType,Steps,TraceJobs JobSubmitPlugins=lua SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core SchedulerParameters=max_script_size=20971520 EnforcePartLimits=ALL CoreSpecPlugin=core_spec/none AccountingStorageType=accounting_storage/slurmdbd AccountingStoreFlags=job_comment JobCompType=jobcomp/filetxt JobCompLoc=/var/log/slurm/job_completions ClusterName=my_cluster JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=5 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=5 SlurmdLogFile=/var/log/slurmd.log AccountingStorageEnforce=limits AccountingStorageHost=my_server NodeName=clus[01-06] CPUs=12 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=128387 TmpDisk=81880 Feature=big-mem NodeName=clus[07-12] CPUs=12 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=15491 TmpDisk=81880 Feature=small-mem NodeName=clus-login CPUs=4 SocketsPerBoard=2 CoresperSocket=2 ThreadsperCore=1 RealMemory=15886 TmpDisk=30705 PartitionName=nodo.q Nodes=clus[01-12] Default=YES MaxTime=04:00:00 State=UP AllocNodes=clus-login,clus05 MaxCPUsPerNode=12 KillOnBadExit=1 OverTimeLimit=30 # si el trabajo dura mas de 30 minutos despues del tiempo maximo (2 horas), se cancela TCPTimeout=5 PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityCalcPeriod=5 PriorityUsageResetPeriod=QUARTERLY PriorityFavorSmall=NO PriorityMaxAge=7-0 PriorityWeightAge=10000 PriorityWeightFairshare=1000000 PriorityWeightJobSize=1000 PriorityWeightPartition=1000 PriorityWeightQOS=0 PropagateResourceLimitsExcept=MEMLOCK
And testing script is this: #!/bin/bash #SBATCH --time=5-00:00:00 srun /bin/hostname date sleep 50 date
Why my job is being submited into the queue and not refused BEFORE being queued?
Thanks.
What is the contents of your /etc/slurm/job_submit.lua file? Did you reconfigure slurmctld? Check the log file by: grep job_submit /var/log/slurm/slurmctld.log What is your Slurm version?
You can read about job_submit plugins in this Wiki page: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#job-submit-pl...
I hope this helps, Ole
On 3/20/24 09:49, Gestió Servidors via slurm-users wrote:
after adding “EnforcePartLimits=ALL” in slurm.conf and restarting slurmctld daemon, job continues being accepted… so I don’t undertand where I’m doing some wrong.
My slurm.conf is this:
ControlMachine=my_server
MailProg=/bin/mail
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
SlurmdUser=root
AuthType=auth/munge
StateSaveLocation=/var/log/slurm
SwitchType=switch/none
TaskPlugin=task/none,task/affinity,task/cgroup
TaskPluginParam=none
DebugFlags=NO_CONF_HASH,Backfill,BackfillMap,SelectType,Steps,TraceJobs
*JobSubmitPlugins=lua*
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
SchedulerParameters=max_script_size=20971520
*EnforcePartLimits=ALL*
CoreSpecPlugin=core_spec/none
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreFlags=job_comment
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm/job_completions
ClusterName=my_cluster
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=5
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=5
SlurmdLogFile=/var/log/slurmd.log
AccountingStorageEnforce=limits
AccountingStorageHost=my_server
NodeName=clus[01-06] CPUs=12 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=128387 TmpDisk=81880 Feature=big-mem
NodeName=clus[07-12] CPUs=12 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=15491 TmpDisk=81880 Feature=small-mem
NodeName=clus-login CPUs=4 SocketsPerBoard=2 CoresperSocket=2 ThreadsperCore=1 RealMemory=15886 TmpDisk=30705
*PartitionName=nodo.q Nodes=clus[01-12] Default=YES MaxTime=04:00:00 State=UP AllocNodes=clus-login,clus05 MaxCPUsPerNode=12*
KillOnBadExit=1
OverTimeLimit=30 # si el trabajo dura mas de 30 minutos despues del tiempo maximo (2 horas), se cancela
TCPTimeout=5
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityCalcPeriod=5
PriorityUsageResetPeriod=QUARTERLY
PriorityFavorSmall=NO
PriorityMaxAge=7-0
PriorityWeightAge=10000
PriorityWeightFairshare=1000000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0
PropagateResourceLimitsExcept=MEMLOCK
And testing script is this:
#!/bin/bash
*#SBATCH --time=5-00:00:00*
srun /bin/hostname
date
sleep 50
date
Why my job is being submited into the queue and not refused BEFORE being queued?