October 2024 - slurm-users - lists.schedmd.com

license server redundancy
by Yves Kondoszek 03 Oct '24

03 Oct '24

Hello. I'm configuring a cluster that will run jobs that use FlexLM licenses. We have what I believe is a quite standard configuration of license servers - though I haven't found anything in the documentation or in the mailing list archives - that often provide the same features across several servers; for example, the license token "simulation" will be available at let's say 1717@server1, 1717@server2 and 4000@server3 . FlexLM clients support those through a path-style environment variable, typically LM_LICENSE_FILE=1717@server1:1717@server2:4000@server3 . Now, I have set up Slurm for remote dynamic licenses and update them live through "LastConsumed" from lmstat requests, so that Slurm is aware of some users checking them out externally (out of Slurm control). But my problem is that the documentation clearly states that: "When submitting jobs to remote licenses, the name and server must be used." So I have to specifically choose a server: $ sbatch -L simulation@server2 script.sh instead of telling Slurm to just use whatever license is available on any of the servers. Does any of you have any experience with redundant FlexLM licenses through several servers? Is there a way Slurm could support this?

1 0

GPU Accounting
by Emyr James 03 Oct '24

03 Oct '24

We have a node with 8 H100 GPUs that are split into MIG instances. We are using cgroups. This seems to work fine. Users can do something like sbatch --gres="gpu:1g.10gb:1"... and the job starts on the node with the gpus and cuda visible devices and the pytorch debug shows that the cgroup only gives them the gpu they asked for. In the accounting database, jobs in the job table always have the "gres_used" column be empty. I'd expect to see "gpu:1g.10gb:1" appearing for the job above. I have this set in slurm.conf AccountingStorageTRES=gres/gpu How can I see what gres was requested with the job ? At the moment I only see something like this in AllocTres billing=1,cpu=1,gres/gpu=1,mem=8G,node=1 and can't see any way to see what the specific MIG gpu asked for was. This is related to the email from Richard Lefebvre dated 7th June 2023 entitled "Billing/accounting for MIGs is not working". As far as I can see this got no replies. We are running slurm version 23.11.6. Regards, Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation

2 1

SLUG Presentations Now Online!
by Victoria Hobson 01 Oct '24

01 Oct '24

Available presentations from this year's SLUG event are now online. They can be found at https://www.schedmd.com/publications/ We thank all those who presented and attended for a great event! -- Victoria Hobson SchedMD LLC Vice President of Marketing

1 0

Job Step State
by Emyr James 01 Oct '24

01 Oct '24

Dear all, I am working on a script to take completed job accounting data from the slurm accounting database and insert the equivalent data into a clickhouse table for fast reporting I can see that all the information is included in the cluster_job_table and cluster_job_step_table which seem to be joined on job_db_inx To get the cpu usage and peak memory usage etc. I can see that I need to parse the tres columns in the job steps. I couldn't find any column called MaxRSS in the database even though the sacct command prints this. I then found some data in tres_table and assume that sacct is using this. Please correct me if I'm wrong and if sacct is getting information from somwhere other than the accounting database? for the state column I get this... select state, count(*) as num from crg_step_table group by state order by num desc limit 10; +-------+--------+ | state | num | +-------+--------+ | 3 | 590635 | | 5 | 28345 | | 4 | 4401 | | 11 | 962 | | 1 | 8 | +-------+--------+ When I use sacct I see statuses seach as COMPLETED, OUT_OF_MEMORY etc. so there must be a mapping somewhere between these state ids and that text. Can someone prvide that mapping or point me to where it's defined in the database or in the code ? Many thanks, Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation

2 2

Hardcoded CGroups v2 Slice
by Khalid Al-Hawaj 01 Oct '24

01 Oct '24

Hello, I am in the process of setting up SLURM to be used in a profiling cluster. The purpose of SLURM is to allow users to submit jobs to be profiled. The latency is a very important aspect of profiling the applications correctly. I was able to leverage cgroupsv2.0 to isolate user.slice from the cores that would be used by SLURM jobs. The issue is that slurmstepd shares the resources with system.slice; I was digging through the code, and I saw that the creation of the scope is here: https://github.com/SchedMD/slurm/blob/master/src/plugins/cgroup/v2/cgroup_v… And I noticed that the slice is hardcoded in the following line: https://github.com/SchedMD/slurm/blob/master/src/plugins/cgroup/v2/cgroup_v… So, my question, now, is about why is the slice hardcoded? What was the reason behind such a decision? I would have thought that the slice chosen would be set through cgroups.conf, instead. I would like to switch the slice for slurmstepd to a slice other than system.slice; by doing so, I would be able to isolate cores better by making sure that services' processes are isolated from the cores used for SLURM jobs. I can definitely change the defined value in the code and recompile. Are there anything to consider before doing so? Thanks, Khalid

1 0