Dear all,
I am working on a script to take completed job accounting data from the slurm accounting database and insert the equivalent data into a clickhouse table for fast reporting
I can see that all the information is included in the cluster_job_table and cluster_job_step_table which seem to be joined on job_db_inx
To get the cpu usage and peak memory usage etc. I can see that I need to parse the tres columns in the job steps. I couldn't find any column called MaxRSS in the database even …
[View More]though the sacct command prints this. I then found some data in tres_table and assume that sacct is using this. Please correct me if I'm wrong and if sacct is getting information from somwhere other than the accounting database?
for the state column I get this...
select state, count(*) as num from crg_step_table group by state order by num desc limit 10;
+-------+--------+
| state | num |
+-------+--------+
| 3 | 590635 |
| 5 | 28345 |
| 4 | 4401 |
| 11 | 962 |
| 1 | 8 |
+-------+--------+
When I use sacct I see statuses seach as COMPLETED, OUT_OF_MEMORY etc. so there must be a mapping somewhere between these state ids and that text. Can someone prvide that mapping or point me to where it's defined in the database or in the code ?
Many thanks,
Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation
[View Less]
Hello,
I am in the process of setting up SLURM to be used in a profiling cluster.
The purpose of SLURM is to allow users to submit jobs to be profiled. The
latency is a very important aspect of profiling the applications correctly.
I was able to leverage cgroupsv2.0 to isolate user.slice from the cores
that would be used by SLURM jobs. The issue is that slurmstepd shares the
resources with system.slice; I was digging through the code, and I saw that
the creation of the scope is here:
https://…
[View More]github.com/SchedMD/slurm/blob/master/src/plugins/cgroup/v2/cgroup_v…
And I noticed that the slice is hardcoded in the following line:
https://github.com/SchedMD/slurm/blob/master/src/plugins/cgroup/v2/cgroup_v…
So, my question, now, is about why is the slice hardcoded? What was the
reason behind such a decision? I would have thought that the slice chosen
would be set through cgroups.conf, instead.
I would like to switch the slice for slurmstepd to a slice other than
system.slice; by doing so, I would be able to isolate cores better by
making sure that services' processes are isolated from the cores used for
SLURM jobs. I can definitely change the defined value in the code and
recompile. Are there anything to consider before doing so?
Thanks,
Khalid
[View Less]
Awesome, thanks Victoria!
Cheers,
--
Kilian
On Thu, Sep 26, 2024 at 11:17 AM Victoria Hobson <victoria(a)schedmd.com>
wrote:
> Hi Kilian,
>
> We're getting these posted now and an email will go out when they are
> available!
>
> Thanks,
>
>
> Victoria Hobson
>
> *Vice President of Marketing *
>
> 909.609.8889
>
> www.schedmd.com
>
>
> On Mon, Sep 23, 2024 at 10:49 AM Kilian Cavalotti via slurm-users <
> slurm-users(a)lists.…
[View More]schedmd.com> wrote:
>
>> Hi SchedMD,
>>
>> I'm sure they will eventually, but do you know when the slides of the
>> SLUG'24 presentation will be available online at
>> https://slurm.schedmd.com/publications.html, like previous editions'?
>>
>> Thanks!
>> --
>> Kilian
>>
>> --
>> slurm-users mailing list -- slurm-users(a)lists.schedmd.com
>> To unsubscribe send an email to slurm-users-leave(a)lists.schedmd.com
>>
>
--
Kilian
[View Less]
Hi all,
We hit a snag when updating our clusters from Slurm 23.02 to 24.05. After updating the slurmdbd, our multi cluster setup was broken until everything was updated to 24.05. We had not anticipated this.
SchedMD says that fixing it would be a very complex operation.
Hence, this warning to everybody on planning to update: make sure to quickly updating everything once you've updated the slurmdbd daemon.
Reference: https://support.schedmd.com/show_bug.cgi?id=20931
Ward
Hi,
On our cluster we have some jobs that are queued even though there are available nodes to run on. The listed reason is "priority" but that doesn't really make sense to me. Slurm isn't picking another job to run on those nodes; it's just not running anything at all. We do have a quite heterogeneous cluster, but as far as I can tell the queued jobs aren't requesting anything that would preclude them from running on the idle nodes. They are array jobs, if that makes a difference.
Thanks for …
[View More]any help you all can provide.
[View Less]
Hello,
We are looking for a method to limit the TRES used by each user on a per-node basis. For example, we would like to limit the total memory allocation of jobs from a user to 200G per node.
There is MaxTRESperNode (https://slurm.schedmd.com/sacctmgr.html#OPT_MaxTRESPerNode), but unfortunately, this is a per-job limit, not per user.
Ideally, we would like to apply this limit on partitions and/or QoS. Does anyone know if this is possible and how to achieve it?
Thank you,
Hi all,
I recently wrote an SLURM input plugin [0] for Telegraf [1].
I just wanted to let the community know so that you can use it if you'd
find that useful.
Maybe its existence can also be included in the documentation somewhere?
Anyway, thanks a ton for your time,
Pablo Collado Soto
References:
0: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/slurm
1: https://www.influxdata.com/time-series-platform/telegraf/
+ -------------------------------------- +
| …
[View More]Never let your sense of morals prevent |
| you from doing what is right. |
| -- Salvor Hardin, "Foundation" |
+ -------------------------------------- +
[View Less]
Hello,
We have a new cluster and I'm trying to setup fairshare accounting. I'm trying to track CPU, MEM and GPU. It seems that billing for individual jobs is correct, but billing isn't being accumulated (TRESRunMin is always 0).
In my slurm.conf, I think the relevant lines are
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageTRES=gres/gpu
PriorityFlags=MAX_TRES
PartitionName=gpu Nodes=node[1-7] MaxCPUsPerNode=384 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.…
[View More]125G,GRES/gpu=9.6"
PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
I currently have one recently finished job and one running job. sacct gives
$ sacct --format=JobID,JobName,ReqTRES%50,AllocTRES%50,TRESUsageInAve%50,TRESUsageInMax%50
JobID JobName ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax
------------ ---------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- --------------------------------------------------
154 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 billing=9,cpu=2,gres/gpu=1,mem=2G,node=1
154.interac+ interacti+ cpu=2,gres/gpu=1,mem=2G,node=1 cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+ cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+
155 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 billing=9,cpu=2,gres/gpu=1,mem=2G,node=1155.interac+ interacti+ cpu=2,gres/gpu=1,mem=2G,node=1
billing=9 seems correct to me, since I have 1 GPU allocated, which has the largest score of 9.6. However, sshare doesn't show anything in TRESRunMins
sshare --format=Account,User,RawShares,FairShare,RawUsage,EffectvUsage,TRESRunMins%110
Account User RawShares FairShare RawUsage EffectvUsage TRESRunMins
-------------------- ---------- ---------- ---------- ----------- ------------- --------------------------------------------------------------------------------------------------------------
root 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
abrol_group 2000 0 0.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
luchko_group 2000 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0 luchko_group tluchko 1 0.333333 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
Why is TRESRunMin all 0 but RawUsage is not for tluchko? I have checked and slurmdbd is running.
Thank you,
Tyler
Sent with [Proton Mail](https://proton.me/) secure email.
[View Less]
Hi SchedMD,
I'm sure they will eventually, but do you know when the slides of the
SLUG'24 presentation will be available online at
https://slurm.schedmd.com/publications.html, like previous editions'?
Thanks!
--
Kilian
Hi
I'm using dynamic nodes with "slurmd -Z" with slurm 23.11.1.
Firstly I find that when you do "scontrol show node" it shows the NODEADDR as ip rather than the NODENAME. Because I'm playing around with running this in containers on docker swarm I find this ip can be wrong. I can force it with scontrol update however after a while something updates it to something else again. Does anybody know if this is done by slurmd or slurmctld or something else?
How can I stop this from happening?
How can …
[View More]I get the node to register with the hostname rather than ip?
cheers,
Jakub
[View Less]
Hello,
*Issue 1:*
I am using slurm version 24.05.1 , my slurmd has a single node where I
connect multiple gres by enabling the overscribe feature.
I am able to use the advance reservation of gres only using *gres** name*
(tres=gres/gpu:*SYSTEM12*).
i.e while in reservation period , if other users submits job with SYSTEM12
, then slurm places this job in queue
*user1@host$ srun --gres=gpu:SYSTEM12:1 hostname*
*srun: job 333 queued and waiting for resources *
but when other users just submit …
[View More]a job without any system name , slurm
jobs goes through on that gres immediately even though it is reserved.
*user1@host$ srun --gres=gpu:1 hostname
*
*mylinux.wbi.com <http://mylinux.wbi.com> *
Also I can see GresUsed in busy mode using "*scontrol show node -d*" ,
this means the job is running on Gres/GPU and not on cpu etc.
Same way , job submission based on Feature "rev1 in my case" is also going
through even though it is reserved for other users in multiple partition
slurm.
*snippet of slurm.conf file*
NodeName=cluster01 NodeAddr=cluster Port=6002CPUs=8 Boards=1
SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 Feature="rev1"
Gres=gpu:SYSTEM12:1 RealMemory=64171 State=IDLE
*Issue 2:*
while execution , Slurm o/p's some extra prints in the srun output
user1@host$ srun --gres=gpu:1 hostname
srun: error: extract_net_cred: net_cred not provided
srun: error: Malformed RPC of type RESPONSE_NODE_ALIAS_ADDRS(3017)
received
srun: error:
slurm_unpack_received_msg: [[inv1715771615.nxdi.us-aus01.nxp.com]:41242]
Header lengths are longer than data received
*mylinux.wbi.com <http://mylinux.wbi.com>*
Regards,
MS
[View Less]
Dear slurm-user list,
I have a cloud node that is powered up and down on demand. Rarely it can
happen that slurm's resumeTimeout is reached and the node is therefore
powered down. We have set ReturnToService=2 in order to avoid the node
being marked down, because the instance behind that node is created on
demand and therefore after a failure nothing stops the system to start
the node again as it is a different instance.
I thought this would be enough, but apparently the node is still marked
…
[View More]with "NOT_RESPONDING" which leads to slurm not trying to schedule on it.
After a while NOT_RESPONDING is removed, but I would like to move it
directly from within my fail script if possible so that the node can
return to service immediately and not be blocked by "NOT_RESPONDING".
Best regards,
Xaver
[View Less]
OS: CentOS 8.5
Slurm: 22.05
Recently upgraded to 22.05. Upgrade was successful, but after a while I started to see the following messages in the slurmdbd.log file:
error: We have more time than is possible (9344745+7524000+0)(16868745) > 12362400 for cluster CLUSTERNAME(3434) from 2024-09-18T13:00:00 - 2024-09-18T14:00:00 tres 1 (this may happen if oversubscription of resources is allowed without Gang)
We do have partitions with overlapping nodes, but do not have "Suspend,Gang" set as the …
[View More]global PreemptMode mode. It is currently set to requeue.
I have also check sacct and there are no runaway jobs listed.
Oversubscription is not enabled on any of the queues as well.
Do I need to modify my slurm config to address or is this an error condition caused by the upgrade?
Thank you,
SS
[View Less]
Hello,
is it possible to change a pending job from --exclusive to
--exclusive=user? I tried scontrol update jobid=... oversubscribe=user,
but it seems to only accept yes or no.
Gerhard
Hello
We have another batch of new users and some more batches of large array jobs with very short runtimes due to errors in the jobs or just by design. Trying to deal with these issues, Setting ArrayTaskThrottle and user education, I had a thought that it would be very nice to have a limit on how many jobs can start in a given minute for users, so if they posted a 200000 array job with 15 second tasks then the scheduler wouldn't launch more than a 100 or 200 per minute and be less likely to …
[View More]bog down, but if they had longer runtimes (1 hour +) it would take a few extra minutes to start using all the resources they are allowed to but not add much overall delay to the whole set of jobs.
I thought about adding something to our CLI filter, but usually these jobs are asking for a runtime of 3-4 hours even though they run for <30 seconds so the submit options don't indicate the problem jobs ahead of time.
We currently limit our users to %80 of the available resources which is way more than slurm needs to bog down with fast turnover jobs, but we have users who complain that they can't use that other 20% when the cluster is not busy so putting in lower default restrictions is not currently an option.
Has this already been discussed and isn't feasible for technical reasons? (Not finding anything like this yet searching the archives)
I think slurm used have a feature request severity on their bug submission site. Is there a severity level they prefer to have suggested requests like this?
Thanks
[View Less]
Dear all SLUG attendees!
The information about which buildings/addresses the SLUG reception and
presentations are to be held is not very visible on
the https://slug24.splashthat.com. There is a map there with all locations
(https://www.google.com/maps/d/u/0/edit?mid=1bcGaTiW0TNB5noQsjQ3ulctzKuqlGrQ…),
but I've gotten questions about it, so:
The reception on Wednesday will be held in the top floor of Oslo Science Park
(Forskningsparken). Address: Gaustadalléen 21. There will be someone
in …
[View More]the reception who can point you in the right direction.
The presentations will be held in auditorium 3 in Helga Engs Hus ("Helga
Eng's House"). Address: Sem Sælands vei 7. Lunch will be in the
canteen in the same building.
The closest subway station to both these buildings is Blindern Subway
Station (Blindern T-banestasjon).
Looking forward to see you there!
--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
[View Less]
Hi,
This is a follow-up from
https://groups.google.com/g/slurm-users/c/JI3UkbCtj3U, but as I could not
find any progress, I am opening a new thread.
When setting IgnoreSystemd=yes in cgroup.conf, I have the error:
error: common_file_write_content: unable to open
'/sys/fs/cgroup/system.slice/cgroup.subtree_control' for writing: No such
file or directory
error: Cannot enable cpuset in
/sys/fs/cgroup/system.slice/cgroup.subtree_control: No such file or
directory
error: common_file_write_content:…
[View More] unable to open
'/sys/fs/cgroup/system.slice/cgroup.subtree_control' for writing: No such
file or directory
error: Cannot enable memory in
/sys/fs/cgroup/system.slice/cgroup.subtree_control: No such file or
directory
error: common_file_write_content: unable to open
'/sys/fs/cgroup/system.slice/cgroup.subtree_control' for writing: No such
file or directory
error: Cannot enable cpu in
/sys/fs/cgroup/system.slice/cgroup.subtree_control: No such file or
directory
error: Could not create scope directory
/sys/fs/cgroup/system.slice/slurmstepd.scope: No such file or directory
error: Couldn't load specified plugin name for cgroup/v2: Plugin init()
callback failed
error: cannot create cgroup context for cgroup/v2
error: Unable to initialize cgroup plugin
error: slurmd initialization failed
I wrote a patch that solves this issue:
--- cgroup_v2.c.orig 2024-09-02 13:18:21.376312875 +0200
+++ cgroup_v2.c 2024-09-02 13:22:00.516986953 +0200
@@ -43,6 +43,7 @@
#include <sys/inotify.h>
#include <poll.h>
#include <unistd.h>
+#include <libgen.h>
#include "slurm/slurm.h"
#include "slurm/slurm_errno.h"
@@ -743,11 +744,33 @@
return SLURM_SUCCESS;
}
+static int _mkdir(const char *path, mode_t mode)
+{
+ int rc;
+ char *dir, *pdir;
+
+ dir = strdup(path);
+ if (dir == NULL) {
+ return ENOMEM;
+ }
+ pdir = dirname(dir);
+ if (strcmp(pdir, path) != 0) {
+ rc = _mkdir(pdir, mode);
+ if (rc && (errno != EEXIST)) {
+ free(dir);
+ return rc;
+ }
+ }
+ rc = mkdir(path, mode);
+ free(dir);
+ return rc;
+}
+
static int _init_new_scope(char *scope_path)
{
int rc;
- rc = mkdir(scope_path, 0755);
+ rc = _mkdir(scope_path, 0755);
if (rc && (errno != EEXIST)) {
error("Could not create scope directory %s: %m", scope_path);
return SLURM_ERROR;
This patch concerns the file src/plugins/cgroup/v2/cgroup_v2.c. Am I
missing something ?
Cheers,
Honoré.
[View Less]
Hi,
We have a number of machines in our compute cluster that have larger disks
available for local data. I would like to add them to the same partition as
the rest of the nodes but assign them a larger TmpDisk value which would
allow users to request a larger tmp and land on those machines.
The main hurdle is that (for reasons beyond my control) the larger local
disks are on a special mount point /largertmp whereas the rest of the
compute cluster uses the vanilla /tmp. I can't see an obvious …
[View More]way to make
this work as the TmpFs value appears to be global only and attempting to
set TmpDisk to a value larger than TmpFs for those nodes will put the
machine into an invalid state.
I couldn't see any similar support tickets or anything in the mail archive
but I wouldn't have thought it would be that unusual to do this.
Thanks in advance!
Jake
[View Less]
Hi,
With
$ salloc --version
slurm 23.11.10
and
$ grep LaunchParameters /etc/slurm/slurm.conf
LaunchParameters=use_interactive_step
the following
$ salloc --partition=interactive --ntasks=1 --time=00:03:00 --mem=1000 --qos=standard
salloc: Granted job allocation 18928869
salloc: Nodes c001 are ready for job
creates a job
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18928779 interacti interact loris …
[View More] R 1:05 1 c001
but causes the terminal to block.
From a second terminal I can log into the compute node:
$ ssh c001
[13:39:36] loris@c001 (1000) ~
Is that the expected behaviour or should salloc return a shell directly
on the compute node (like srun --pty /bin/bash -l used to do)?
Cheers,
Loris
--
Dr. Loris Bennett (Herr/Mr)
FUB-IT, Freie Universität Berlin
[View Less]
Is there a description of the “nodelist” syntax and semantics somewhere other than the source code? By “nodelist” I mean expressions like “name[000,099-100]” and how this one, for example, expands to “name000, name099, name100”.
--
Gary
Hi all,
We have a number of machines in our compute cluster that have larger disks
available for local data. I would like to add them to the same partition as
the rest of the nodes but assign them a larger TmpDisk value which would
allow users to request a larger tmp and land on those machines.
The main hurdle is that (for reasons beyond my control) the larger local
disks are on a special mount point /largertmp whereas the rest of the
compute cluster uses the vanilla /tmp. I can't see an …
[View More]obvious way to make
this work as the TmpFs value appears to be global only and attempting to
set TmpDisk to a value larger than TmpFs for those nodes will put the
machine into an invalid state.
I couldn't see any similar support tickets or anything in the mail archive
but I wouldn't have thought it would be that unusual to do this.
Thanks in advance!
Jake
[View Less]
Hello all,
I am tyring to build a custom plugin to force some jobs to be pended.
In the official document, `ESLURM*` errors are only valid for `job_submit_lua`.
I tried to send `ESLURM_JOB_PENDING`, but it only rejects the job submission.
Does anyone know how to pend a job in job_submit plugin?
Thanks.
Hello,
we found an issue with Slurm 24.05.1 and the MaxMemPerNode
setting. Slurm is installed in a single workstation, and thus, the
number of nodes is just 1.
The relevant sections in slurm.conf read:
,----
| EnforcePartLimits=ALL
| PartitionName=short Nodes=..... State=UP Default=YES MaxTime=2-00:00:00 MaxCPUsPerNode=76 MaxMemPerNode=231000 OverSubscribe=FORCE:1
`----
Now, if I submit a job requesting 76 CPUs and each one needing 4000M
(for a total of 304000M), Slurm does indeed …
[View More]respect the MaxMemPerNode
setting and the job is not submitted in the following cases ("-N 1" is
not really necessary, as there is only one node):
,----
| $ sbatch -N 1 -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not available
|
| $ sbatch -N 1 -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not available
|
| $ sbatch -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not available
`----
But with this submission Slurm is happy:
,----
| $ sbatch -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch
| Submitted batch job 133982
`----
and the slurmjobcomp.log file does indeed tell me that the memory went
above MaxMemPerNode:
,----
| JobId=133982 UserId=......(10487) GroupId=domain users(2000) Name=test JobState=CANCELLED Partition=short TimeLimit=45 StartTime=2024-09-04T09:11:17 EndTime=2024-09-04T09:11:24 NodeList=...... NodeCnt=1 ProcCnt=76 WorkDir=/tmp/. ReservationName= Tres=cpu=76,mem=304000M,node=1,billing=76 Account=ddgroup QOS=domino WcKey= Cluster=...... SubmitTime=2024-09-04T09:11:17 EligibleTime=2024-09-04T09:11:17 DerivedExitCode=0:0 ExitCode=0:0
`----
What is the best way to report issues like this to the Slurm developers?
I thought of adding it to https://support.schedmd.com/ but it is not
clear to me if that page is only meant for Slurm users with a Support
Contract?
Cheers,
--
Ángel de Vicente
Research Software Engineer (Supercomputing and BigData)
Instituto de Astrofísica de Canarias (https://www.iac.es/en)
[View Less]