[slurm-users] Virtual memory size requested by slurm
Sean Maxwell
stm at case.edu
Tue Jan 28 12:33:37 UTC 2020
Hi Mahmood,
If you want the virtual memory size to be unrestricted by slurm, set
VSizeFactor to 0 in slurm.conf, which according to the documentation
disables virtual memory limit enforcement.
https://slurm.schedmd.com/slurm.conf.html#OPT_VSizeFactor
-Sean
On Mon, Jan 27, 2020 at 11:47 PM Mahmood Naderan <mahmood.nt at gmail.com>
wrote:
> >This line is probably what is limiting you to around 40gb.
>
> >#SBATCH --mem=38GB
>
> Yes. If I change that value, the "ulimit -v" also changes. See below
>
> [shams at hpc ~]$ cat slurm_blast.sh | grep mem
> #SBATCH --mem=50GB
> [shams at hpc ~]$ cat my_blast.log
> virtual memory (kbytes, -v) 57671680
> /var/spool/slurmd/job00306/slurm_script: line 13: ulimit: virtual memory:
> cannot modify limit: Operation not permitted
> virtual memory (kbytes, -v) 57671680
> Error memory mapping:/home/shams/ncbi-blast-2.9.0+/bin/nr.69.psq
> openedFilesCount=168 threadID=0
> Error: NCBI C++ Exception:
>
>
> However, the solution is not to change that parameter. There are two issue
> with that:
>
> 1) --mem belongs to the physical memory which is requested by job and is
> later reserved for the job by slurm.
> So, on a 64GB node, if a user requests --mem=50GB, actually no one else
> can run a job with 10GB memory need.
>
> 2) The virtual size of the program (according) to the top is about 140GB.
> So, if I set --mem=140GB, the job stuck in the queue because requested
> information is invalid (node has 64GB of memory).
>
>
> I really think there is a problem with slurm but can not find the root of
> the problem. The slurm config parameters are
>
> Configuration data as of 2020-01-28T08:04:55
> AccountingStorageBackupHost = (null)
> AccountingStorageEnforce = associations,limits,qos,safe,wckeys
> AccountingStorageHost = hpc
> AccountingStorageLoc = N/A
> AccountingStoragePort = 6819
> AccountingStorageTRES =
> cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu
> AccountingStorageType = accounting_storage/slurmdbd
> AccountingStorageUser = N/A
> AccountingStoreJobComment = Yes
> AcctGatherEnergyType = acct_gather_energy/none
> AcctGatherFilesystemType = acct_gather_filesystem/none
> AcctGatherInterconnectType = acct_gather_interconnect/none
> AcctGatherNodeFreq = 0 sec
> AcctGatherProfileType = acct_gather_profile/none
> AllowSpecResourcesUsage = 0
> AuthAltTypes = (null)
> AuthInfo = (null)
> AuthType = auth/munge
> BatchStartTimeout = 10 sec
> BOOT_TIME = 2020-01-27T09:53:58
> BurstBufferType = (null)
> CheckpointType = checkpoint/none
> CliFilterPlugins = (null)
> ClusterName = jupiter
> CommunicationParameters = (null)
> CompleteWait = 0 sec
> CoreSpecPlugin = core_spec/none
> CpuFreqDef = Unknown
> CpuFreqGovernors = Performance,OnDemand,UserSpace
> CredType = cred/munge
> DebugFlags = Backfill,BackfillMap,NO_CONF_HASH,Priority
> DefMemPerNode = UNLIMITED
> DisableRootJobs = No
> EioTimeout = 60
> EnforcePartLimits = NO
> Epilog = (null)
> EpilogMsgTime = 2000 usec
> EpilogSlurmctld = (null)
> ExtSensorsType = ext_sensors/none
> ExtSensorsFreq = 0 sec
> FairShareDampeningFactor = 5
> FastSchedule = 0
> FederationParameters = (null)
> FirstJobId = 1
> GetEnvTimeout = 2 sec
> GresTypes = gpu
> GpuFreqDef = high,memory=high
> GroupUpdateForce = 1
> GroupUpdateTime = 600 sec
> HASH_VAL = Match
> HealthCheckInterval = 0 sec
> HealthCheckNodeState = ANY
> HealthCheckProgram = (null)
> InactiveLimit = 30 sec
> JobAcctGatherFrequency = 30
> JobAcctGatherType = jobacct_gather/linux
> JobAcctGatherParams = (null)
> JobCheckpointDir = /var/spool/slurm.checkpoint
> JobCompHost = hpc
> JobCompLoc = /var/log/slurm_jobcomp.log
> JobCompPort = 0
> JobCompType = jobcomp/none
> JobCompUser = root
> JobContainerType = job_container/none
> JobCredentialPrivateKey = (null)
> JobCredentialPublicCertificate = (null)
> JobDefaults = (null)
> JobFileAppend = 0
> JobRequeue = 1
> JobSubmitPlugins = (null)
> KeepAliveTime = SYSTEM_DEFAULT
> KillOnBadExit = 0
> KillWait = 60 sec
> LaunchParameters = (null)
> LaunchType = launch/slurm
> Layouts =
> Licenses = (null)
> LicensesUsed = (null)
> LogTimeFormat = iso8601_ms
> MailDomain = (null)
> MailProg = /bin/mail
> MaxArraySize = 1001
> MaxJobCount = 10000
> MaxJobId = 67043328
> MaxMemPerNode = UNLIMITED
> MaxStepCount = 40000
> MaxTasksPerNode = 512
> MCSPlugin = mcs/none
> MCSParameters = (null)
> MessageTimeout = 10 sec
> MinJobAge = 300 sec
> MpiDefault = none
> MpiParams = (null)
> MsgAggregationParams = (null)
> NEXT_JOB_ID = 305
> NodeFeaturesPlugins = (null)
> OverTimeLimit = 0 min
> PluginDir = /usr/lib64/slurm
> PlugStackConfig = /etc/slurm/plugstack.conf
> PowerParameters = (null)
> PowerPlugin =
> PreemptMode = OFF
> PreemptType = preempt/none
> PreemptExemptTime = 00:00:00
> PriorityParameters = (null)
> PrioritySiteFactorParameters = (null)
> PrioritySiteFactorPlugin = (null)
> PriorityDecayHalfLife = 14-00:00:00
> PriorityCalcPeriod = 00:05:00
> PriorityFavorSmall = No
> PriorityFlags =
> PriorityMaxAge = 1-00:00:00
> PriorityUsageResetPeriod = NONE
> PriorityType = priority/multifactor
> PriorityWeightAge = 10
> PriorityWeightAssoc = 0
> PriorityWeightFairShare = 10000
> PriorityWeightJobSize = 100
> PriorityWeightPartition = 10000
> PriorityWeightQOS = 0
> PriorityWeightTRES = cpu=2000,mem=1,gres/gpu=400
> PrivateData = none
> ProctrackType = proctrack/linuxproc
> Prolog = (null)
> PrologEpilogTimeout = 65534
> PrologSlurmctld = (null)
> PrologFlags = (null)
> PropagatePrioProcess = 0
> PropagateResourceLimits = ALL
> PropagateResourceLimitsExcept = (null)
> RebootProgram = (null)
> ReconfigFlags = (null)
> RequeueExit = (null)
> RequeueExitHold = (null)
> ResumeFailProgram = (null)
> ResumeProgram = /etc/slurm/resumehost.sh
> ResumeRate = 4 nodes/min
> ResumeTimeout = 450 sec
> ResvEpilog = (null)
> ResvOverRun = 0 min
> ResvProlog = (null)
> ReturnToService = 2
> RoutePlugin = route/default
> SallocDefaultCommand = (null)
> SbcastParameters = (null)
> SchedulerParameters = (null)
> SchedulerTimeSlice = 30 sec
> SchedulerType = sched/backfill
> SelectType = select/cons_res
> SelectTypeParameters = CR_CORE_MEMORY
> SlurmUser = root(0)
> SlurmctldAddr = (null)
> SlurmctldDebug = info
> SlurmctldHost[0] = hpc(10.1.1.1)
> SlurmctldLogFile = /var/log/slurm/slurmctld.log
> SlurmctldPort = 6817
> SlurmctldSyslogDebug = unknown
> SlurmctldPrimaryOffProg = (null)
> SlurmctldPrimaryOnProg = (null)
> SlurmctldTimeout = 300 sec
> SlurmctldParameters = (null)
> SlurmdDebug = info
> SlurmdLogFile = /var/log/slurm/slurmd.log
> SlurmdParameters = (null)
> SlurmdPidFile = /var/run/slurmd.pid
> SlurmdPort = 6818
> SlurmdSpoolDir = /var/spool/slurmd
> SlurmdSyslogDebug = unknown
> SlurmdTimeout = 300 sec
> SlurmdUser = root(0)
> SlurmSchedLogFile = (null)
> SlurmSchedLogLevel = 0
> SlurmctldPidFile = /var/run/slurmctld.pid
> SlurmctldPlugstack = (null)
> SLURM_CONF = /etc/slurm/slurm.conf
> SLURM_VERSION = 19.05.2
> SrunEpilog = (null)
> SrunPortRange = 0-0
> SrunProlog = (null)
> StateSaveLocation = /var/spool/slurm.state
> SuspendExcNodes = (null)
> SuspendExcParts = (null)
> SuspendProgram = /etc/slurm/suspendhost.sh
> SuspendRate = 4 nodes/min
> SuspendTime = NONE
> SuspendTimeout = 45 sec
> SwitchType = switch/none
> TaskEpilog = (null)
> TaskPlugin = task/affinity
> TaskPluginParam = (null type)
> TaskProlog = (null)
> TCPTimeout = 2 sec
> TmpFS = /state/partition1
> TopologyParam = (null)
> TopologyPlugin = topology/none
> TrackWCKey = Yes
> TreeWidth = 50
> UsePam = 0
> UnkillableStepProgram = (null)
> UnkillableStepTimeout = 60 sec
> VSizeFactor = 110 percent
> WaitTime = 60 sec
> X11Parameters = (null)
>
>
> Regards,
> Mahmood
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200128/8a64f83e/attachment-0001.htm>
More information about the slurm-users
mailing list