[slurm-users] Virtual memory size requested by slurm

Mahmood Naderan mahmood.nt at gmail.com
Tue Jan 28 04:45:34 UTC 2020


>This line is probably what is limiting you to around 40gb.

>#SBATCH --mem=38GB

Yes. If I change that value, the "ulimit -v" also changes. See below

[shams at hpc ~]$ cat slurm_blast.sh | grep mem
#SBATCH --mem=50GB
[shams at hpc ~]$ cat my_blast.log
virtual memory          (kbytes, -v) 57671680
/var/spool/slurmd/job00306/slurm_script: line 13: ulimit: virtual memory:
cannot modify limit: Operation not permitted
virtual memory          (kbytes, -v) 57671680
Error memory mapping:/home/shams/ncbi-blast-2.9.0+/bin/nr.69.psq
openedFilesCount=168 threadID=0
Error: NCBI C++ Exception:


However, the solution is not to change that parameter. There are two issue
with that:

1) --mem belongs to the physical memory which is requested by job and is
later reserved for the job by slurm.
So, on a 64GB node, if a user requests --mem=50GB, actually no one else can
run a job with 10GB memory need.

2) The virtual size of the program (according) to the top is about 140GB.
So, if I set --mem=140GB, the job stuck in the queue because requested
information is invalid (node has 64GB of memory).


I really think there is a problem with slurm but can not find the root of
the problem. The slurm config parameters are

Configuration data as of 2020-01-28T08:04:55
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos,safe,wckeys
AccountingStorageHost   = hpc
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   =
cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 0
AuthAltTypes            = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2020-01-27T09:53:58
BurstBufferType         = (null)
CheckpointType          = checkpoint/none
CliFilterPlugins        = (null)
ClusterName             = jupiter
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand,UserSpace
CredType                = cred/munge
DebugFlags              = Backfill,BackfillMap,NO_CONF_HASH,Priority
DefMemPerNode           = UNLIMITED
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = NO
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 5
FastSchedule            = 0
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 30 sec
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)
JobCheckpointDir        = /var/spool/slurm.checkpoint
JobCompHost             = hpc
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 60 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 =
Licenses                = (null)
LicensesUsed            = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 305
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PowerParameters         = (null)
PowerPlugin             =
PreemptMode             = OFF
PreemptType             = preempt/none
PreemptExemptTime       = 00:00:00
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 14-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           =
PriorityMaxAge          = 1-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 10
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 10000
PriorityWeightJobSize   = 100
PriorityWeightPartition = 10000
PriorityWeightQOS       = 0
PriorityWeightTRES      = cpu=2000,mem=1,gres/gpu=400
PrivateData             = none
ProctrackType           = proctrack/linuxproc
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = /etc/slurm/resumehost.sh
ResumeRate              = 4 nodes/min
ResumeTimeout           = 450 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 2
RoutePlugin             = route/default
SallocDefaultCommand    = (null)
SbcastParameters        = (null)
SchedulerParameters     = (null)
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = root(0)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = hpc(10.1.1.1)
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 300 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 19.05.2
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /var/spool/slurm.state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = /etc/slurm/suspendhost.sh
SuspendRate             = 4 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 45 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/affinity
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /state/partition1
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = Yes
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 110 percent
WaitTime                = 60 sec
X11Parameters           = (null)


Regards,
Mahmood
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200128/2db09aae/attachment-0001.htm>


More information about the slurm-users mailing list