[slurm-users] Virtual memory size requested by slurm
Mahmood Naderan
mahmood.nt at gmail.com
Tue Jan 28 04:45:34 UTC 2020
>This line is probably what is limiting you to around 40gb.
>#SBATCH --mem=38GB
Yes. If I change that value, the "ulimit -v" also changes. See below
[shams at hpc ~]$ cat slurm_blast.sh | grep mem
#SBATCH --mem=50GB
[shams at hpc ~]$ cat my_blast.log
virtual memory (kbytes, -v) 57671680
/var/spool/slurmd/job00306/slurm_script: line 13: ulimit: virtual memory:
cannot modify limit: Operation not permitted
virtual memory (kbytes, -v) 57671680
Error memory mapping:/home/shams/ncbi-blast-2.9.0+/bin/nr.69.psq
openedFilesCount=168 threadID=0
Error: NCBI C++ Exception:
However, the solution is not to change that parameter. There are two issue
with that:
1) --mem belongs to the physical memory which is requested by job and is
later reserved for the job by slurm.
So, on a 64GB node, if a user requests --mem=50GB, actually no one else can
run a job with 10GB memory need.
2) The virtual size of the program (according) to the top is about 140GB.
So, if I set --mem=140GB, the job stuck in the queue because requested
information is invalid (node has 64GB of memory).
I really think there is a problem with slurm but can not find the root of
the problem. The slurm config parameters are
Configuration data as of 2020-01-28T08:04:55
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos,safe,wckeys
AccountingStorageHost = hpc
AccountingStorageLoc = N/A
AccountingStoragePort = 6819
AccountingStorageTRES =
cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu
AccountingStorageType = accounting_storage/slurmdbd
AccountingStorageUser = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq = 0 sec
AcctGatherProfileType = acct_gather_profile/none
AllowSpecResourcesUsage = 0
AuthAltTypes = (null)
AuthInfo = (null)
AuthType = auth/munge
BatchStartTimeout = 10 sec
BOOT_TIME = 2020-01-27T09:53:58
BurstBufferType = (null)
CheckpointType = checkpoint/none
CliFilterPlugins = (null)
ClusterName = jupiter
CommunicationParameters = (null)
CompleteWait = 0 sec
CoreSpecPlugin = core_spec/none
CpuFreqDef = Unknown
CpuFreqGovernors = Performance,OnDemand,UserSpace
CredType = cred/munge
DebugFlags = Backfill,BackfillMap,NO_CONF_HASH,Priority
DefMemPerNode = UNLIMITED
DisableRootJobs = No
EioTimeout = 60
EnforcePartLimits = NO
Epilog = (null)
EpilogMsgTime = 2000 usec
EpilogSlurmctld = (null)
ExtSensorsType = ext_sensors/none
ExtSensorsFreq = 0 sec
FairShareDampeningFactor = 5
FastSchedule = 0
FederationParameters = (null)
FirstJobId = 1
GetEnvTimeout = 2 sec
GresTypes = gpu
GpuFreqDef = high,memory=high
GroupUpdateForce = 1
GroupUpdateTime = 600 sec
HASH_VAL = Match
HealthCheckInterval = 0 sec
HealthCheckNodeState = ANY
HealthCheckProgram = (null)
InactiveLimit = 30 sec
JobAcctGatherFrequency = 30
JobAcctGatherType = jobacct_gather/linux
JobAcctGatherParams = (null)
JobCheckpointDir = /var/spool/slurm.checkpoint
JobCompHost = hpc
JobCompLoc = /var/log/slurm_jobcomp.log
JobCompPort = 0
JobCompType = jobcomp/none
JobCompUser = root
JobContainerType = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults = (null)
JobFileAppend = 0
JobRequeue = 1
JobSubmitPlugins = (null)
KeepAliveTime = SYSTEM_DEFAULT
KillOnBadExit = 0
KillWait = 60 sec
LaunchParameters = (null)
LaunchType = launch/slurm
Layouts =
Licenses = (null)
LicensesUsed = (null)
LogTimeFormat = iso8601_ms
MailDomain = (null)
MailProg = /bin/mail
MaxArraySize = 1001
MaxJobCount = 10000
MaxJobId = 67043328
MaxMemPerNode = UNLIMITED
MaxStepCount = 40000
MaxTasksPerNode = 512
MCSPlugin = mcs/none
MCSParameters = (null)
MessageTimeout = 10 sec
MinJobAge = 300 sec
MpiDefault = none
MpiParams = (null)
MsgAggregationParams = (null)
NEXT_JOB_ID = 305
NodeFeaturesPlugins = (null)
OverTimeLimit = 0 min
PluginDir = /usr/lib64/slurm
PlugStackConfig = /etc/slurm/plugstack.conf
PowerParameters = (null)
PowerPlugin =
PreemptMode = OFF
PreemptType = preempt/none
PreemptExemptTime = 00:00:00
PriorityParameters = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife = 14-00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = No
PriorityFlags =
PriorityMaxAge = 1-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 10
PriorityWeightAssoc = 0
PriorityWeightFairShare = 10000
PriorityWeightJobSize = 100
PriorityWeightPartition = 10000
PriorityWeightQOS = 0
PriorityWeightTRES = cpu=2000,mem=1,gres/gpu=400
PrivateData = none
ProctrackType = proctrack/linuxproc
Prolog = (null)
PrologEpilogTimeout = 65534
PrologSlurmctld = (null)
PrologFlags = (null)
PropagatePrioProcess = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram = (null)
ReconfigFlags = (null)
RequeueExit = (null)
RequeueExitHold = (null)
ResumeFailProgram = (null)
ResumeProgram = /etc/slurm/resumehost.sh
ResumeRate = 4 nodes/min
ResumeTimeout = 450 sec
ResvEpilog = (null)
ResvOverRun = 0 min
ResvProlog = (null)
ReturnToService = 2
RoutePlugin = route/default
SallocDefaultCommand = (null)
SbcastParameters = (null)
SchedulerParameters = (null)
SchedulerTimeSlice = 30 sec
SchedulerType = sched/backfill
SelectType = select/cons_res
SelectTypeParameters = CR_CORE_MEMORY
SlurmUser = root(0)
SlurmctldAddr = (null)
SlurmctldDebug = info
SlurmctldHost[0] = hpc(10.1.1.1)
SlurmctldLogFile = /var/log/slurm/slurmctld.log
SlurmctldPort = 6817
SlurmctldSyslogDebug = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg = (null)
SlurmctldTimeout = 300 sec
SlurmctldParameters = (null)
SlurmdDebug = info
SlurmdLogFile = /var/log/slurm/slurmd.log
SlurmdParameters = (null)
SlurmdPidFile = /var/run/slurmd.pid
SlurmdPort = 6818
SlurmdSpoolDir = /var/spool/slurmd
SlurmdSyslogDebug = unknown
SlurmdTimeout = 300 sec
SlurmdUser = root(0)
SlurmSchedLogFile = (null)
SlurmSchedLogLevel = 0
SlurmctldPidFile = /var/run/slurmctld.pid
SlurmctldPlugstack = (null)
SLURM_CONF = /etc/slurm/slurm.conf
SLURM_VERSION = 19.05.2
SrunEpilog = (null)
SrunPortRange = 0-0
SrunProlog = (null)
StateSaveLocation = /var/spool/slurm.state
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendProgram = /etc/slurm/suspendhost.sh
SuspendRate = 4 nodes/min
SuspendTime = NONE
SuspendTimeout = 45 sec
SwitchType = switch/none
TaskEpilog = (null)
TaskPlugin = task/affinity
TaskPluginParam = (null type)
TaskProlog = (null)
TCPTimeout = 2 sec
TmpFS = /state/partition1
TopologyParam = (null)
TopologyPlugin = topology/none
TrackWCKey = Yes
TreeWidth = 50
UsePam = 0
UnkillableStepProgram = (null)
UnkillableStepTimeout = 60 sec
VSizeFactor = 110 percent
WaitTime = 60 sec
X11Parameters = (null)
Regards,
Mahmood
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200128/2db09aae/attachment-0001.htm>
More information about the slurm-users
mailing list