Slurm User Group (SLUG) 2024 is set for September 12-13 at the
University of Oslo in Oslo, Norway.
Registration information and a high-level schedule can be found
here:https://slug24.splashthat.com/ The last day to register at the
early bird pricing is this Friday, May 31st.
Friday is also the deadline to submit a presentation abstract. We do
not intend to extend this deadline.
If you are interested in presenting your own usage, developments, site
report, tutorial, etc about Slurm, please …
[View More]fill out the following
form:https://forms.gle/N7bFo5EzwuTuKkBN7
Notifications of final presentations accepted will go out by Friday, June 14th.
--
Victoria Hobson
SchedMD LLC
Vice President of Marketing
[View Less]
My organization needs to access historic job information records for metric reporting and resource forecasting. slurmdbd is archiving only the job information, which should be sufficient for our numbers, but is using the default archive script. In retrospect, this data should have been migrated to a secondary MariaDB instance, but that train has passed.
The format of the archive files is not well documented. Does anyone have a program (python/C/whatever) that will read a job_table_archive file …
[View More]and decode it into a parsable structure?
Douglas O'Neal, Ph.D. (contractor)
Manager, HPC Systems Administration Group, ITOG
Frederick National Laboratory for Cancer Research
Leidos Biomedical Research, Inc.
Phone: 301-228-4656
Email: Douglas.O'Neal(a)nih.gov<mailto:Doug%20O'Neal%20%3cDouglas.O'Neal(a)nih.gov%3e>
[View Less]
---------- Forwarded message ---------
From: Hermann Schwärzler <hermann.schwaerzler(a)uibk.ac.at>
Date: Tue, May 28, 2024 at 4:10 PM
Subject: Re: [slurm-users] Re: Performance Discrepancy between Slurm
and Direct mpirun for VASP Jobs.
To: Hongyi Zhao <hongyi.zhao(a)gmail.com>
Hi Zhao,
On 5/28/24 03:08, Hongyi Zhao wrote:
[...]
>
> What's the complete content of cli_filter.lua and where should I put this file?
[...]
Below you find the complete content of our cli_filter.lua.…
[View More]
It has to be put into the same directory as "slurm.conf".
--------------------------------- 8< ---------------------------------
-- see
https://github.com/SchedMD/slurm/blob/master/etc/cli_filter.lua.example
function slurm_cli_pre_submit(options, pack_offset)
return slurm.SUCCESS
end
function slurm_cli_setup_defaults(options, early_pass)
-- Make --hint=nomultithread the default behavior
-- if users specify an other --hint=XX option then
-- it will override the setting done here
options['hint'] = 'nomultithread'
return slurm.SUCCESS
end
function slurm_cli_post_submit(offset, job_id, step_id)
return slurm.SUCCESS
end
--------------------------------- >8 ---------------------------------
Hopefully this helps...
Regards,
Hermann
--
Assoc. Prof. Hongsheng Zhao <hongyi.zhao(a)gmail.com>
Theory and Simulation of Materials
Hebei Vocational University of Technology and Engineering
No. 473, Quannan West Street, Xindu District, Xingtai, Hebei province
[View Less]
Dear Slurm Users,
I am experiencing a significant performance discrepancy when running
the same VASP job through the Slurm scheduler compared to running it
directly with mpirun. I am hoping for some insights or advice on how
to resolve this issue.
System Information:
Slurm Version: 21.08.5
OS: Ubuntu 22.04.4 LTS (Jammy)
Job Submission Script:
#!/usr/bin/env bash
#SBATCH -N 1
#SBATCH -D .
#SBATCH --output=%j.out
#SBATCH --error=%j.err
##SBATCH --time=2-00:00:00
#SBATCH --ntasks=36
#SBATCH -…
[View More]-mem=64G
echo '#######################################################'
echo "date = $(date)"
echo "hostname = $(hostname -s)"
echo "pwd = $(pwd)"
echo "sbatch = $(which sbatch | xargs realpath -e)"
echo ""
echo "WORK_DIR = $WORK_DIR"
echo "SLURM_SUBMIT_DIR = $SLURM_SUBMIT_DIR"
echo "SLURM_JOB_NUM_NODES = $SLURM_JOB_NUM_NODES"
echo "SLURM_NTASKS = $SLURM_NTASKS"
echo "SLURM_NTASKS_PER_NODE = $SLURM_NTASKS_PER_NODE"
echo "SLURM_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK"
echo "SLURM_JOBID = $SLURM_JOBID"
echo "SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST"
echo "SLURM_NNODES = $SLURM_NNODES"
echo "SLURMTMPDIR = $SLURMTMPDIR"
echo '#######################################################'
echo ""
module purge > /dev/null 2>&1
module load vasp
ulimit -s unlimited
mpirun vasp_std
Performance Observation:
When running the job through Slurm:
werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
grep LOOP OUTCAR
LOOP: cpu time 14.4893: real time 14.5049
LOOP: cpu time 14.3538: real time 14.3621
LOOP: cpu time 14.3870: real time 14.3568
LOOP: cpu time 15.9722: real time 15.9018
LOOP: cpu time 16.4527: real time 16.4370
LOOP: cpu time 16.7918: real time 16.7781
LOOP: cpu time 16.9797: real time 16.9961
LOOP: cpu time 15.9762: real time 16.0124
LOOP: cpu time 16.8835: real time 16.9008
LOOP: cpu time 15.2828: real time 15.2921
LOOP+: cpu time 176.0917: real time 176.0755
When running the job directly with mpirun:
werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
mpirun -n 36 vasp_std
werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
grep LOOP OUTCAR
LOOP: cpu time 9.0072: real time 9.0074
LOOP: cpu time 9.0515: real time 9.0524
LOOP: cpu time 9.1896: real time 9.1907
LOOP: cpu time 10.1467: real time 10.1479
LOOP: cpu time 10.2691: real time 10.2705
LOOP: cpu time 10.4330: real time 10.4340
LOOP: cpu time 10.9049: real time 10.9055
LOOP: cpu time 9.9718: real time 9.9714
LOOP: cpu time 10.4511: real time 10.4470
LOOP: cpu time 9.4621: real time 9.4584
LOOP+: cpu time 110.0790: real time 110.0739
Could you provide any insights or suggestions on what might be causing
this performance issue? Are there any specific configurations or
settings in Slurm that I should check or adjust to align the
performance more closely with the direct mpirun execution?
Thank you for your time and assistance.
Best regards,
Zhao
--
Assoc. Prof. Hongsheng Zhao <hongyi.zhao(a)gmail.com>
Theory and Simulation of Materials
Hebei Vocational University of Technology and Engineering
No. 473, Quannan West Street, Xindu District, Xingtai, Hebei province
[View Less]
We have several nodes, most of which have different Linux distributions
(distro for short). Controller has a different distro as well. The only
common thing between controller and all the does is that all of them ar
x86_64.
I can install Slurm using package manager on all the machines but this will
not work because controller will have a different version of Slurm compared
to the nodes (21.08 vs 23.11)
If I build from source then I see two solutions:
- build a deb package
- build a custom …
[View More]package (./configure, make, make install)
Building a debian package on the controller and then distributing the
binaries on nodes won't work either because that binary will start looking
for the shared libraries that it was built for and those don't exist on the
nodes.
So the only solution I have is to build a static binary using a custom
package. Am I correct or is there another solution here?
[View Less]
Hi,
We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the …
[View More]memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ?
I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling.
Many thanks,
Emyr
[View Less]
We are pleased to announce the availability of Slurm version 23.11.7.
The 23.11.7 release fixes a few potential crashes in slurmctld when
using less common options on job submission, slurmrestd compatibility
with auth/slurm, and some additional minor and moderate severity bugs.
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
-Marshall
> -- slurmrestd - Correct OpenAPI specification for
> 'GET /slurm/v0.0.40/jobs/state' having response as null.
> -- …
[View More]Allow running jobs on overlapping partitions if jobs don't specify -s.
> -- Fix segfault when requesting a shared gres along with an exclusive
> allocation.
> -- Fix regression in 23.02 where afternotok and afterok dependencies were
> rejected for federated jobs not running on the origin cluster of the
> submitting job.
> -- slurmctld - Disable job table locking while job state cache is active when
> replying to `squeue --only-job-state` or `GET /slurm/v0.0.40/jobs/state`.
> -- Fix sanity check when setting tres-per-task on the job allocation as well as
> the step.
> -- slurmrestd - Fix compatiblity with auth/slurm.
> -- Fix issue where TRESRunMins gets off correct value if using
> QOS UsageFactor != 1.
> -- slurmrestd - Require `user` and `association_condition` fields to be
> populated for requests to 'POST /slurmdb/v0.0.40/users_association'.
> -- Avoid a slurmctld crash with extra_constraints enabled when a job requests
> certain invalid --extra values.
> -- `scancel --ctld` and `DELETE /slurm/v0.0/40/jobs` - Fix support for job
> array expressions (e.g. 1_[3-5]). Also fix signaling a single pending array
> task (e.g. 1_10), which previously signaled the whole array job instead.
> -- Fix a possible slurmctld segfault when at some point we failed to create an
> external launcher step.
> -- Allow the slurmctld to open a connection to the slurmdbd if the first
> attempt fails due to a protocol error.
> -- mpi/cray_shasta - Fix launch for non-het-steps within a hetjob.
> -- sacct - Fix "gpuutil" TRES usage output being incorrect when using --units.
> -- Fix a rare deadlock on slurmctld shutdown or reconfigure.
> -- Fix issue that only left one thread on each core available when "CPUs=" is
> configured to total thread count on multi-threaded hardware and no other
> topology info ("Sockets=", "CoresPerSocket", etc.) is configured.
> -- Fix the external launcher step not being allocated a VNI when requested.
> -- jobcomp/kafka - Fix payload length when producing and sending a message.
> -- scrun - Avoid a crash if RunTimeDelete is called before the container
> finishes.
> -- Save the slurmd's cred_state while reconfiguring to prevent the loss job
> credentials.
[View Less]
Slurm User Group (SLUG) 2024 is set for September 12-13 at the
University of Oslo in Oslo, Norway.
Registration information and a high-level schedule can be found
here:https://slug24.splashthat.com/
The deadline to submit a presentation abstract is Friday, May 31st. We
do not intend to extend this deadline.
If you are interested in presenting your own usage, developments, site
report, tutorial, etc about Slurm, please fill out the following
form:https://forms.gle/N7bFo5EzwuTuKkBN7
…
[View More]Notifications of final presentations accepted will go out by Friday, June 14th.
--
Victoria Hobson
SchedMD LLC
Vice President of Marketing
[View Less]
Hi,
At our site we have recently upgraded to Slurm 23.11.5 and are having trouble with MPI jobs doing srun inside a sbatch'ed script.
The cgroup does not appear to be setup correctly for the srun (step_0).
As an example
$ cat /sys/fs/cgroup/cpuset/slurm/uid_11000..../job..../cpuset.cpus
0,2-3,68-69,96,98-99,164-165
$ cat /sys/fs/cgroup/cpuset/slurm/uid_11000..../job..../step_0/cpuset.cpus
0,2,68,96,98,164
The sbatch is allocated a range of cpus in the cgroup. However, when step_0 is run, …
[View More]only some of those CPUs are in the group.
I have noticed that it is always the range which is missing, ie 2-5 only 2 is included, 3,4,5 are missing.
This also only happens if there are multiple groups of cpus in the allocations. ie only 1-12 would be fine, however 1-12,15-20 would result in 1,15 only.
The sbatch also seems fine, with step_batch and step_extern being allocated correctly.
This causes numerous issues with MPI jobs as they end up overloading cpus.
We are running our nodes with threading enabled on the CPUs, and with cgroups and affinity plugins.
I have attached our slurm.conf to show our settings.
Our /etc/slurm/cgroup.conf is
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
We have turned on logging at debug2 level, but I haven't yet found anything useful. Happy for a suggestion on what to look for.
Is anyone able to provide any advice on where to go next to try and identify the issue?
Regards,
Ashley Wright
[View Less]