[slurm-users] Virtual memory size requested by slurm

Sean Maxwell stm at case.edu
Tue Jan 28 12:33:37 UTC 2020


Hi Mahmood,

If you want the virtual memory size to be unrestricted by slurm, set
VSizeFactor to 0 in slurm.conf, which according to the documentation
disables virtual memory limit enforcement.

https://slurm.schedmd.com/slurm.conf.html#OPT_VSizeFactor

-Sean

On Mon, Jan 27, 2020 at 11:47 PM Mahmood Naderan <mahmood.nt at gmail.com>
wrote:

> >This line is probably what is limiting you to around 40gb.
>
> >#SBATCH --mem=38GB
>
> Yes. If I change that value, the "ulimit -v" also changes. See below
>
> [shams at hpc ~]$ cat slurm_blast.sh | grep mem
> #SBATCH --mem=50GB
> [shams at hpc ~]$ cat my_blast.log
> virtual memory          (kbytes, -v) 57671680
> /var/spool/slurmd/job00306/slurm_script: line 13: ulimit: virtual memory:
> cannot modify limit: Operation not permitted
> virtual memory          (kbytes, -v) 57671680
> Error memory mapping:/home/shams/ncbi-blast-2.9.0+/bin/nr.69.psq
> openedFilesCount=168 threadID=0
> Error: NCBI C++ Exception:
>
>
> However, the solution is not to change that parameter. There are two issue
> with that:
>
> 1) --mem belongs to the physical memory which is requested by job and is
> later reserved for the job by slurm.
> So, on a 64GB node, if a user requests --mem=50GB, actually no one else
> can run a job with 10GB memory need.
>
> 2) The virtual size of the program (according) to the top is about 140GB.
> So, if I set --mem=140GB, the job stuck in the queue because requested
> information is invalid (node has 64GB of memory).
>
>
> I really think there is a problem with slurm but can not find the root of
> the problem. The slurm config parameters are
>
> Configuration data as of 2020-01-28T08:04:55
> AccountingStorageBackupHost = (null)
> AccountingStorageEnforce = associations,limits,qos,safe,wckeys
> AccountingStorageHost   = hpc
> AccountingStorageLoc    = N/A
> AccountingStoragePort   = 6819
> AccountingStorageTRES   =
> cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu
> AccountingStorageType   = accounting_storage/slurmdbd
> AccountingStorageUser   = N/A
> AccountingStoreJobComment = Yes
> AcctGatherEnergyType    = acct_gather_energy/none
> AcctGatherFilesystemType = acct_gather_filesystem/none
> AcctGatherInterconnectType = acct_gather_interconnect/none
> AcctGatherNodeFreq      = 0 sec
> AcctGatherProfileType   = acct_gather_profile/none
> AllowSpecResourcesUsage = 0
> AuthAltTypes            = (null)
> AuthInfo                = (null)
> AuthType                = auth/munge
> BatchStartTimeout       = 10 sec
> BOOT_TIME               = 2020-01-27T09:53:58
> BurstBufferType         = (null)
> CheckpointType          = checkpoint/none
> CliFilterPlugins        = (null)
> ClusterName             = jupiter
> CommunicationParameters = (null)
> CompleteWait            = 0 sec
> CoreSpecPlugin          = core_spec/none
> CpuFreqDef              = Unknown
> CpuFreqGovernors        = Performance,OnDemand,UserSpace
> CredType                = cred/munge
> DebugFlags              = Backfill,BackfillMap,NO_CONF_HASH,Priority
> DefMemPerNode           = UNLIMITED
> DisableRootJobs         = No
> EioTimeout              = 60
> EnforcePartLimits       = NO
> Epilog                  = (null)
> EpilogMsgTime           = 2000 usec
> EpilogSlurmctld         = (null)
> ExtSensorsType          = ext_sensors/none
> ExtSensorsFreq          = 0 sec
> FairShareDampeningFactor = 5
> FastSchedule            = 0
> FederationParameters    = (null)
> FirstJobId              = 1
> GetEnvTimeout           = 2 sec
> GresTypes               = gpu
> GpuFreqDef              = high,memory=high
> GroupUpdateForce        = 1
> GroupUpdateTime         = 600 sec
> HASH_VAL                = Match
> HealthCheckInterval     = 0 sec
> HealthCheckNodeState    = ANY
> HealthCheckProgram      = (null)
> InactiveLimit           = 30 sec
> JobAcctGatherFrequency  = 30
> JobAcctGatherType       = jobacct_gather/linux
> JobAcctGatherParams     = (null)
> JobCheckpointDir        = /var/spool/slurm.checkpoint
> JobCompHost             = hpc
> JobCompLoc              = /var/log/slurm_jobcomp.log
> JobCompPort             = 0
> JobCompType             = jobcomp/none
> JobCompUser             = root
> JobContainerType        = job_container/none
> JobCredentialPrivateKey = (null)
> JobCredentialPublicCertificate = (null)
> JobDefaults             = (null)
> JobFileAppend           = 0
> JobRequeue              = 1
> JobSubmitPlugins        = (null)
> KeepAliveTime           = SYSTEM_DEFAULT
> KillOnBadExit           = 0
> KillWait                = 60 sec
> LaunchParameters        = (null)
> LaunchType              = launch/slurm
> Layouts                 =
> Licenses                = (null)
> LicensesUsed            = (null)
> LogTimeFormat           = iso8601_ms
> MailDomain              = (null)
> MailProg                = /bin/mail
> MaxArraySize            = 1001
> MaxJobCount             = 10000
> MaxJobId                = 67043328
> MaxMemPerNode           = UNLIMITED
> MaxStepCount            = 40000
> MaxTasksPerNode         = 512
> MCSPlugin               = mcs/none
> MCSParameters           = (null)
> MessageTimeout          = 10 sec
> MinJobAge               = 300 sec
> MpiDefault              = none
> MpiParams               = (null)
> MsgAggregationParams    = (null)
> NEXT_JOB_ID             = 305
> NodeFeaturesPlugins     = (null)
> OverTimeLimit           = 0 min
> PluginDir               = /usr/lib64/slurm
> PlugStackConfig         = /etc/slurm/plugstack.conf
> PowerParameters         = (null)
> PowerPlugin             =
> PreemptMode             = OFF
> PreemptType             = preempt/none
> PreemptExemptTime       = 00:00:00
> PriorityParameters      = (null)
> PrioritySiteFactorParameters = (null)
> PrioritySiteFactorPlugin = (null)
> PriorityDecayHalfLife   = 14-00:00:00
> PriorityCalcPeriod      = 00:05:00
> PriorityFavorSmall      = No
> PriorityFlags           =
> PriorityMaxAge          = 1-00:00:00
> PriorityUsageResetPeriod = NONE
> PriorityType            = priority/multifactor
> PriorityWeightAge       = 10
> PriorityWeightAssoc     = 0
> PriorityWeightFairShare = 10000
> PriorityWeightJobSize   = 100
> PriorityWeightPartition = 10000
> PriorityWeightQOS       = 0
> PriorityWeightTRES      = cpu=2000,mem=1,gres/gpu=400
> PrivateData             = none
> ProctrackType           = proctrack/linuxproc
> Prolog                  = (null)
> PrologEpilogTimeout     = 65534
> PrologSlurmctld         = (null)
> PrologFlags             = (null)
> PropagatePrioProcess    = 0
> PropagateResourceLimits = ALL
> PropagateResourceLimitsExcept = (null)
> RebootProgram           = (null)
> ReconfigFlags           = (null)
> RequeueExit             = (null)
> RequeueExitHold         = (null)
> ResumeFailProgram       = (null)
> ResumeProgram           = /etc/slurm/resumehost.sh
> ResumeRate              = 4 nodes/min
> ResumeTimeout           = 450 sec
> ResvEpilog              = (null)
> ResvOverRun             = 0 min
> ResvProlog              = (null)
> ReturnToService         = 2
> RoutePlugin             = route/default
> SallocDefaultCommand    = (null)
> SbcastParameters        = (null)
> SchedulerParameters     = (null)
> SchedulerTimeSlice      = 30 sec
> SchedulerType           = sched/backfill
> SelectType              = select/cons_res
> SelectTypeParameters    = CR_CORE_MEMORY
> SlurmUser               = root(0)
> SlurmctldAddr           = (null)
> SlurmctldDebug          = info
> SlurmctldHost[0]        = hpc(10.1.1.1)
> SlurmctldLogFile        = /var/log/slurm/slurmctld.log
> SlurmctldPort           = 6817
> SlurmctldSyslogDebug    = unknown
> SlurmctldPrimaryOffProg = (null)
> SlurmctldPrimaryOnProg  = (null)
> SlurmctldTimeout        = 300 sec
> SlurmctldParameters     = (null)
> SlurmdDebug             = info
> SlurmdLogFile           = /var/log/slurm/slurmd.log
> SlurmdParameters        = (null)
> SlurmdPidFile           = /var/run/slurmd.pid
> SlurmdPort              = 6818
> SlurmdSpoolDir          = /var/spool/slurmd
> SlurmdSyslogDebug       = unknown
> SlurmdTimeout           = 300 sec
> SlurmdUser              = root(0)
> SlurmSchedLogFile       = (null)
> SlurmSchedLogLevel      = 0
> SlurmctldPidFile        = /var/run/slurmctld.pid
> SlurmctldPlugstack      = (null)
> SLURM_CONF              = /etc/slurm/slurm.conf
> SLURM_VERSION           = 19.05.2
> SrunEpilog              = (null)
> SrunPortRange           = 0-0
> SrunProlog              = (null)
> StateSaveLocation       = /var/spool/slurm.state
> SuspendExcNodes         = (null)
> SuspendExcParts         = (null)
> SuspendProgram          = /etc/slurm/suspendhost.sh
> SuspendRate             = 4 nodes/min
> SuspendTime             = NONE
> SuspendTimeout          = 45 sec
> SwitchType              = switch/none
> TaskEpilog              = (null)
> TaskPlugin              = task/affinity
> TaskPluginParam         = (null type)
> TaskProlog              = (null)
> TCPTimeout              = 2 sec
> TmpFS                   = /state/partition1
> TopologyParam           = (null)
> TopologyPlugin          = topology/none
> TrackWCKey              = Yes
> TreeWidth               = 50
> UsePam                  = 0
> UnkillableStepProgram   = (null)
> UnkillableStepTimeout   = 60 sec
> VSizeFactor             = 110 percent
> WaitTime                = 60 sec
> X11Parameters           = (null)
>
>
> Regards,
> Mahmood
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200128/8a64f83e/attachment-0001.htm>


More information about the slurm-users mailing list