Hi all,
I’m trying to pull (and understand) some GPU usage metrics for historical purposes, and dug into sacct’s TRES reporting a bit. We have AccountingStorageTRES=gres/gpu set in slurm.conf so we do see gres/gpuutil and gres/gpumem numbers available, but I’m struggling to find Slurm-side documentation that describes the units of these values. In looking at the code for gpu_nvml.c it seems the “nvmlDeviceGetProcessUtilization” function is being used and returns units in percentages, but I’m …
[View More]lost on the rest of the calculation.
Does anyone know if these units are percentages, and how they are calculated for the final job record, especially wrt multi-GPU jobs with a bunch of processes/moving parts? For context I’ve been looking at TRESUsageInTot and TRESUsageInAve so far. Also we’re currently running Slurm v23.02.6
Thanks in advance!
--
Jordan Robertson
Preferred pronouns: he/him/his
Technology Architect | Research Technology Services
DigITs, Technology Division
Memorial Sloan Kettering Cancer Center
929-687-1066
robertj8(a)mskcc.org<mailto:robertj8@mskcc.org>
=====================================================================
Please note that this e-mail and any files transmitted from
Memorial Sloan Kettering Cancer Center may be privileged, confidential,
and protected from disclosure under applicable law. If the reader of
this message is not the intended recipient, or an employee or agent
responsible for delivering this message to the intended recipient,
you are hereby notified that any reading, dissemination, distribution,
copying, or other use of this communication or any of its attachments
is strictly prohibited. If you have received this communication in
error, please notify the sender immediately by replying to this message
and deleting this message, any attachments, and all copies and backups
from your computer.
Disclaimer ID:MSKCC
[View Less]
its friday and i'm either doing something silly or have a misconfig
somewhere, i can't figure out which
when i run
sbatch --nodes=1 --cpus-per-task=1 --array=1-100 --output
test_%A_%a.txt --wrap 'uname -n'
sbatch doesn't seem to be adhering to the --nodes param. when i look
at my output files it's spreading them across more nodes. in the
simple case above it's 50/50, but if i through a random sleep in,
it'll be more. and if i expand the array it'll use even more nodes.
i'm using con/tres …
[View More]and have cr_core_memory,cr_one_core_per_task set
[View Less]
Hi Sean,
I appear to be having the same issue that you are having with OCI
container jobs running forever / appearing to hang. I haven't figured
it out yet, but perhaps we can compare notes and determine what aspect
of configuration we both share.
Like you, I was following the examples in
https://slurm.schedmd.com/containers.html and originally encountered
the issue with an alpine container image running the `uptime` command,
but I have also confirmed the issue with other images including …
[View More]ubuntu
and with other processes. I always get the same results - the
container process runs to completion and exits, but then the slurm job
continues to run until it is cancelled or killed.
I have slurm v23.11.6 and am using the nvidia-container-runtime, what
slurm version and runtime are you using?
My oci.conf is:
```
$ cat /etc/slurm/oci.conf
EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)="
RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)="
RunTimeQuery="nvidia-container-runtime --rootless=true
--root=/run/user/%U/ state %n.%u.%j.%s.%t"
RunTimeKill="nvidia-container-runtime --rootless=true
--root=/run/user/%U/ kill -a %n.%u.%j.%s.%t"
RunTimeDelete="nvidia-container-runtime --rootless=true
--root=/run/user/%U/ delete --force %n.%u.%j.%s.%t"
RunTimeRun="nvidia-container-runtime --rootless=true
--root=/run/user/%U/ run %n.%u.%j.%s.%t -b %b"
```
Hope that we can get to the bottom of this and resolve our issues with
OCI containers!
Josh.
---
Hello. I am new to this list and Slurm overall. I have a lot of
experience in computer operations, including Kubernetes, but I am
currently exploring Slurm in some depth.
I have set up a small cluster and, in general, have gotten things
working, but when I try to run a container job, it runs the command
but then appears to hang as if the job container is still running.
So, running the following works, but it never returns to the prompt
unless I use [Control-C].
$ srun --container /shared_fs/shared/oci_images/alpine uptime
19:21:47 up 20:43, 0 users, load average: 0.03, 0.25, 0.15
I'm unsure if something is misconfigured or if I'm misunderstanding
how this should work, but any help and/or pointers would be greatly
appreciated.
Thanks!
Sean
--
slurm-users mailing list -- slurm...(a)lists.schedmd.com
To unsubscribe send an email to slurm-us...(a)lists.schedmd.com
--
Dr. Joshua C. Randall
Principal Software Engineer
Altos Labs
email: jrandall(a)altoslabs.com
--
Altos Labs UK Limited | England | Company reg 13484917
Registered
address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom,
WA14 2DT
[View Less]
Hi everyone
I am trying to get slurmdbd to run on my local home server but I am really
struggling.
Note : am a novice slurm user
my slurmdbd always times out even though all the details in the conf file
are correct
My log looks like this
[2024-05-29T20:51:30.088] Accounting storage MYSQL plugin loaded
[2024-05-29T20:51:30.088] debug2: ArchiveDir = /tmp
[2024-05-29T20:51:30.088] debug2: ArchiveScript = (null)
[2024-05-29T20:51:30.088] debug2: AuthAltTypes = (null)
[2024-05-29T20:51:30.088] …
[View More]debug2: AuthInfo = (null)
[2024-05-29T20:51:30.088] debug2: AuthType = auth/munge
[2024-05-29T20:51:30.088] debug2: CommitDelay = 0
[2024-05-29T20:51:30.088] debug2: DbdAddr = localhost
[2024-05-29T20:51:30.088] debug2: DbdBackupHost = (null)
[2024-05-29T20:51:30.088] debug2: DbdHost = head-node
[2024-05-29T20:51:30.088] debug2: DbdPort = 7032
[2024-05-29T20:51:30.088] debug2: DebugFlags = (null)
[2024-05-29T20:51:30.088] debug2: DebugLevel = 6
[2024-05-29T20:51:30.088] debug2: DebugLevelSyslog = 10
[2024-05-29T20:51:30.088] debug2: DefaultQOS = (null)
[2024-05-29T20:51:30.088] debug2: LogFile = /var/log/slurmdbd.log
[2024-05-29T20:51:30.088] debug2: MessageTimeout = 100
[2024-05-29T20:51:30.088] debug2: Parameters = (null)
[2024-05-29T20:51:30.088] debug2: PidFile = /run/slurmdbd.pid
[2024-05-29T20:51:30.088] debug2: PluginDir =
/usr/lib/x86_64-linux-gnu/slurm-wlm
[2024-05-29T20:51:30.088] debug2: PrivateData = none
[2024-05-29T20:51:30.088] debug2: PurgeEventAfter = 1 months*
[2024-05-29T20:51:30.088] debug2: PurgeJobAfter = 12 months*
[2024-05-29T20:51:30.088] debug2: PurgeResvAfter = 1 months*
[2024-05-29T20:51:30.088] debug2: PurgeStepAfter = 1 months
[2024-05-29T20:51:30.088] debug2: PurgeSuspendAfter = 1 months
[2024-05-29T20:51:30.088] debug2: PurgeTXNAfter = 12 months
[2024-05-29T20:51:30.088] debug2: PurgeUsageAfter = 24 months
[2024-05-29T20:51:30.088] debug2: SlurmUser = root(0)
[2024-05-29T20:51:30.089] debug2: StorageBackupHost = (null)
[2024-05-29T20:51:30.089] debug2: StorageHost = localhost
[2024-05-29T20:51:30.089] debug2: StorageLoc = slurm_acct_db
[2024-05-29T20:51:30.089] debug2: StoragePort = 3306
[2024-05-29T20:51:30.089] debug2: StorageType = accounting_storage/mysql
[2024-05-29T20:51:30.089] debug2: StorageUser = slurm
[2024-05-29T20:51:30.089] debug2: TCPTimeout = 2
[2024-05-29T20:51:30.089] debug2: TrackWCKey = 0
[2024-05-29T20:51:30.089] debug2: TrackSlurmctldDown= 0
[2024-05-29T20:51:30.089] debug2: acct_storage_p_get_connection: request
new connection 1
[2024-05-29T20:51:30.089] debug2: Attempting to connect to localhost:3306
[2024-05-29T20:51:30.090] slurmdbd version 19.05.5 started
[2024-05-29T20:51:30.090] debug2: running rollup at Wed May 29 20:51:30
2024
[2024-05-29T20:51:30.091] debug2: Everything rolled up
[2024-05-29T20:51:49.673] Terminate signal (SIGINT or SIGTERM) received
[2024-05-29T20:51:49.673] debug: rpc_mgr shutting down
my config file looks like this
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=no
ArchiveSuspend=no
ArchiveTXN=no
ArchiveUsage=no
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month
# Authentication info
AuthType=auth/munge
# slurmDBD info
DbdAddr=localhost
DbdHost=head-node
DbdPort=7032
SlurmUser=root
MessageTimeout=100
DebugLevel=5
#DefaultQOS=normal,standby
LogFile=/var/log/slurmdbd.log
PidFile=/run/slurmdbd.pid
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes
#
# Database info
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePort=3306
StoragePass=slurmdbpass
StorageUser=slurm
StorageLoc=slurm_acct_db
I used standard names and passwords to get started and I will change later
but everytime I try to start slurmdbd.service it crashes and I have that
log that I shared with you
I use these versions
slurmdbd -V
slurm-wlm 19.05.5
mysql Ver 15.1 Distrib 10.3.39-MariaDB, for debian-linux-gnu (x86_64) using
readline 5.2
Everything else Is working properly except I cannot get slurmdbd to work
and at this point I exhausted all my possible trials :) looking for some
expert insights :)
Any idea what I am doing wrong here ? Also I didn't compile any slurm
package. I used the binary from apt repos
Any help will be appreciated
Cheers
Rad
--
[View Less]
manually running it through sudo slurmdbd -D /path/to/conf is very quick on
my fresh install
trying to start the slurmdbd through systemctl take 3 minutes and then
crashes and fail
Is there an alternative to systemctl to start the slurmdbd in the
background ?
But most importantly I wanted to know why it takes so long through
systemctl. Maybe I can increase the timeout limit ?
On Thu, May 30, 2024 at 11:54 PM Ryan Novosielski <novosirj(a)rutgers.edu>
wrote:
> It may take longer to …
[View More]start than systemd allows for. How long does it take
> to start from the command line? It’s common to need to run it manually for
> upgrades to complete.
>
> --
> #BlackLivesMatter
> ____
> || \\UTGERS, |---------------------------*O*---------------------------
> ||_// the State | Ryan Novosielski - novosirj(a)rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> || \\ of NJ | Office of Advanced Research Computing - MSB
> A555B, Newark
> `'
>
> On May 30, 2024, at 20:24, Radhouane Aniba via slurm-users <
> slurm-users(a)lists.schedmd.com> wrote:
>
> Ok I made some progress here.
>
> I removed and purged slurmdbd mysql mariadb etc .. and started from
> scratch.
> I added the recommended mysqld requirements
>
> Started slurmdbd manually : sudo slurmdbd -D /path/to/conf and everything
> worked well
>
> When I tried to start the service sudo systemctl start slurmdbd.service
> it didnt work
>
> sudo systemctl status slurmdbd.service
> ● slurmdbd.service - Slurm DBD accounting daemon
> Loaded: loaded (/etc/systemd/system/slurmdbd.service; enabled; vendor
> preset: enabled)
> Active: failed (Result: timeout) since Fri 2024-05-31 00:21:30 UTC;
> 2min 5s ago
> Process: 6258 ExecStart=/usr/sbin/slurmdbd -D
> /etc/slurm-llnl/slurmdbd.conf (code=exited, status=0/SUCCESS)
>
> May 31 00:20:00 hannibal-hn systemd[1]: Starting Slurm DBD accounting
> daemon...
> May 31 00:21:30 hannibal-hn systemd[1]: slurmdbd.service: start operation
> timed out. Terminating.
> May 31 00:21:30 hannibal-hn systemd[1]: slurmdbd.service: Failed with
> result 'timeout'.
> May 31 00:21:30 hannibal-hn systemd[1]: Failed to start Slurm DBD
> accounting daemon.
>
> Even though it is the same command ?!
>
> Any idea ?
>
>
> On Thu, May 30, 2024 at 5:02 PM Radhouane Aniba <aradwen(a)gmail.com> wrote:
>
>> Thank you Ahmet and Brian,
>>
>> Ahmet, which conf in particular slurmdbd is readiugn from, I parsed all
>> the cnf files for mysql and I cannot find the data it is displaying here
>>
>> slurmdbd: debug2: Attempting to connect to localhost:3306
>> slurmdbd: debug2: innodb_buffer_pool_size: 134217728
>> slurmdbd: debug2: innodb_log_file_size: 50331648
>> slurmdbd: debug2: innodb_lock_wait_timeout: 50
>> slurmdbd: error: Database settings not recommended values:
>> innodb_buffer_pool_size innodb_lock_wait_timeout
>>
>>
>> sudo tree /etc/mysql/*
>> /etc/mysql/conf.d
>> ├── mysql.cnf
>> └── mysqldump.cnf
>> /etc/mysql/debian.cnf
>> /etc/mysql/debian-start
>> /etc/mysql/FROZEN
>> /etc/mysql/mariadb.cnf
>> /etc/mysql/mariadb.conf.d
>> ├── 50-client.cnf
>> ├── 50-mysql-clients.cnf
>> ├── 50-mysqld_safe.cnf
>> └── 50-server.cnf
>> /etc/mysql/my.cnf
>> /etc/mysql/my.cnf.fallback
>> /etc/mysql/mysql.cnf
>> /etc/mysql/mysql.conf.d
>> ├── mysql.cnf
>> └── mysqld.cnf
>>
>> On Thu, May 30, 2024 at 12:21 PM Brian Andrus via slurm-users <
>> slurm-users(a)lists.schedmd.com> wrote:
>>
>>> That SIGTERM message means something is telling slurmdbd to quit.
>>>
>>> Check your cron jobs, maintenance scripts, etc. Slurmdbd is being told
>>> to shutdown. If you are running in the foreground, a ^C does that. If you
>>> run a kill or killall on it, you will get that same message.
>>>
>>> Brian Andrus
>>> On 5/30/2024 6:53 AM, Radhouane Aniba via slurm-users wrote:
>>>
>>> Yes I can connect to my database using mysql --user=slurm
>>> --password=slurmdbpass slurm_acct_db and there is no firewall blocking
>>> mysql after checking the firewall question
>>>
>>> ALso here is the output of slurmdbd -D -vvv (note I can only run this as
>>> sudo )
>>>
>>> sudo slurmdbd -D -vvv
>>> slurmdbd: debug: Log file re-opened
>>> slurmdbd: debug: Munge authentication plugin loaded
>>> slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
>>> slurmdbd: debug2: Attempting to connect to localhost:3306
>>> slurmdbd: debug2: innodb_buffer_pool_size: 134217728
>>> slurmdbd: debug2: innodb_log_file_size: 50331648
>>> slurmdbd: debug2: innodb_lock_wait_timeout: 50
>>> slurmdbd: error: Database settings not recommended values:
>>> innodb_buffer_pool_size innodb_lock_wait_timeout
>>> slurmdbd: Accounting storage MYSQL plugin loaded
>>> slurmdbd: debug2: ArchiveDir = /tmp
>>> slurmdbd: debug2: ArchiveScript = (null)
>>> slurmdbd: debug2: AuthAltTypes = (null)
>>> slurmdbd: debug2: AuthInfo = (null)
>>> slurmdbd: debug2: AuthType = auth/munge
>>> slurmdbd: debug2: CommitDelay = 0
>>> slurmdbd: debug2: DbdAddr = localhost
>>> slurmdbd: debug2: DbdBackupHost = (null)
>>> slurmdbd: debug2: DbdHost = hannibal-hn
>>> slurmdbd: debug2: DbdPort = 7032
>>> slurmdbd: debug2: DebugFlags = (null)
>>> slurmdbd: debug2: DebugLevel = 6
>>> slurmdbd: debug2: DebugLevelSyslog = 10
>>> slurmdbd: debug2: DefaultQOS = (null)
>>> slurmdbd: debug2: LogFile = /var/log/slurmdbd.log
>>> slurmdbd: debug2: MessageTimeout = 100
>>> slurmdbd: debug2: Parameters = (null)
>>> slurmdbd: debug2: PidFile = /run/slurmdbd.pid
>>> slurmdbd: debug2: PluginDir = /usr/lib/x86_64-linux-gnu/slurm-wlm
>>> slurmdbd: debug2: PrivateData = none
>>> slurmdbd: debug2: PurgeEventAfter = 1 months*
>>> slurmdbd: debug2: PurgeJobAfter = 12 months*
>>> slurmdbd: debug2: PurgeResvAfter = 1 months*
>>> slurmdbd: debug2: PurgeStepAfter = 1 months
>>> slurmdbd: debug2: PurgeSuspendAfter = 1 months
>>> slurmdbd: debug2: PurgeTXNAfter = 12 months
>>> slurmdbd: debug2: PurgeUsageAfter = 24 months
>>> slurmdbd: debug2: SlurmUser = root(0)
>>> slurmdbd: debug2: StorageBackupHost = (null)
>>> slurmdbd: debug2: StorageHost = localhost
>>> slurmdbd: debug2: StorageLoc = slurm_acct_db
>>> slurmdbd: debug2: StoragePort = 3306
>>> slurmdbd: debug2: StorageType = accounting_storage/mysql
>>> slurmdbd: debug2: StorageUser = slurm
>>> slurmdbd: debug2: TCPTimeout = 2
>>> slurmdbd: debug2: TrackWCKey = 0
>>> slurmdbd: debug2: TrackSlurmctldDown= 0
>>> slurmdbd: debug2: acct_storage_p_get_connection: request new connection
>>> 1
>>> slurmdbd: debug2: Attempting to connect to localhost:3306
>>> slurmdbd: slurmdbd version 19.05.5 started
>>> slurmdbd: debug2: running rollup at Thu May 30 13:50:08 2024
>>> slurmdbd: debug2: Everything rolled up
>>>
>>>
>>> It goes like this for some time and then it crashes with this message
>>>
>>> slurmdbd: Terminate signal (SIGINT or SIGTERM) received
>>> slurmdbd: debug: rpc_mgr shutting down
>>>
>>>
>>> On Thu, May 30, 2024 at 8:18 AM mercan <ahmet.mercan(a)uhem.itu.edu.tr>
>>> wrote:
>>>
>>>> Did you try to connect database using mysql command?
>>>>
>>>> mysql --user=slurm --password=slurmdbpass slurm_acct_db
>>>>
>>>> C. Ahmet Mercan
>>>>
>>>> On 30.05.2024 14:48, Radhouane Aniba via slurm-users wrote:
>>>>
>>>> Thank you Ahmet,
>>>> I dont have a firewall active.
>>>> And because slurmdbd cannot connect to the database I am not able to
>>>> getting it to be activated through systemctl I will share the output for
>>>> slurmdbd -D -vvv shortly but overall it is always saying trying to connect
>>>> to the db and then retries a couple of times and crashes
>>>>
>>>> R.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, May 30, 2024 at 2:51 AM mercan <ahmet.mercan(a)uhem.itu.edu.tr>
>>>> wrote:
>>>>
>>>>> Hi;
>>>>>
>>>>> Did you check can you connect db with your conf parameters from
>>>>> head-node:
>>>>>
>>>>> mysql --user=slurm --password=slurmdbpass slurm_acct_db
>>>>>
>>>>> Also, check and stop firewall and selinux, if they are running.
>>>>>
>>>>> Last, you can stop slurmdbd, then run run terminal with:
>>>>>
>>>>> slurmdbd -D -vvv
>>>>>
>>>>> Regards;
>>>>>
>>>>> C. Ahmet Mercan
>>>>>
>>>>> On 30.05.2024 00:05, Radhouane Aniba via slurm-users wrote:
>>>>>
>>>>> Hi everyone
>>>>> I am trying to get slurmdbd to run on my local home server but I am
>>>>> really struggling.
>>>>> Note : am a novice slurm user
>>>>> my slurmdbd always times out even though all the details in the conf
>>>>> file are correct
>>>>>
>>>>> My log looks like this
>>>>>
>>>>> [2024-05-29T20:51:30.088] Accounting storage MYSQL plugin loaded
>>>>> [2024-05-29T20:51:30.088] debug2: ArchiveDir = /tmp
>>>>> [2024-05-29T20:51:30.088] debug2: ArchiveScript = (null)
>>>>> [2024-05-29T20:51:30.088] debug2: AuthAltTypes = (null)
>>>>> [2024-05-29T20:51:30.088] debug2: AuthInfo = (null)
>>>>> [2024-05-29T20:51:30.088] debug2: AuthType = auth/munge
>>>>> [2024-05-29T20:51:30.088] debug2: CommitDelay = 0
>>>>> [2024-05-29T20:51:30.088] debug2: DbdAddr = localhost
>>>>> [2024-05-29T20:51:30.088] debug2: DbdBackupHost = (null)
>>>>> [2024-05-29T20:51:30.088] debug2: DbdHost = head-node
>>>>> [2024-05-29T20:51:30.088] debug2: DbdPort = 7032
>>>>> [2024-05-29T20:51:30.088] debug2: DebugFlags = (null)
>>>>> [2024-05-29T20:51:30.088] debug2: DebugLevel = 6
>>>>> [2024-05-29T20:51:30.088] debug2: DebugLevelSyslog = 10
>>>>> [2024-05-29T20:51:30.088] debug2: DefaultQOS = (null)
>>>>> [2024-05-29T20:51:30.088] debug2: LogFile = /var/log/slurmdbd.log
>>>>> [2024-05-29T20:51:30.088] debug2: MessageTimeout = 100
>>>>> [2024-05-29T20:51:30.088] debug2: Parameters = (null)
>>>>> [2024-05-29T20:51:30.088] debug2: PidFile = /run/slurmdbd.pid
>>>>> [2024-05-29T20:51:30.088] debug2: PluginDir =
>>>>> /usr/lib/x86_64-linux-gnu/slurm-wlm
>>>>> [2024-05-29T20:51:30.088] debug2: PrivateData = none
>>>>> [2024-05-29T20:51:30.088] debug2: PurgeEventAfter = 1 months*
>>>>> [2024-05-29T20:51:30.088] debug2: PurgeJobAfter = 12 months*
>>>>> [2024-05-29T20:51:30.088] debug2: PurgeResvAfter = 1 months*
>>>>> [2024-05-29T20:51:30.088] debug2: PurgeStepAfter = 1 months
>>>>> [2024-05-29T20:51:30.088] debug2: PurgeSuspendAfter = 1 months
>>>>> [2024-05-29T20:51:30.088] debug2: PurgeTXNAfter = 12 months
>>>>> [2024-05-29T20:51:30.088] debug2: PurgeUsageAfter = 24 months
>>>>> [2024-05-29T20:51:30.088] debug2: SlurmUser = root(0)
>>>>> [2024-05-29T20:51:30.089] debug2: StorageBackupHost = (null)
>>>>> [2024-05-29T20:51:30.089] debug2: StorageHost = localhost
>>>>> [2024-05-29T20:51:30.089] debug2: StorageLoc = slurm_acct_db
>>>>> [2024-05-29T20:51:30.089] debug2: StoragePort = 3306
>>>>> [2024-05-29T20:51:30.089] debug2: StorageType =
>>>>> accounting_storage/mysql
>>>>> [2024-05-29T20:51:30.089] debug2: StorageUser = slurm
>>>>> [2024-05-29T20:51:30.089] debug2: TCPTimeout = 2
>>>>> [2024-05-29T20:51:30.089] debug2: TrackWCKey = 0
>>>>> [2024-05-29T20:51:30.089] debug2: TrackSlurmctldDown= 0
>>>>> [2024-05-29T20:51:30.089] debug2: acct_storage_p_get_connection:
>>>>> request new connection 1
>>>>> [2024-05-29T20:51:30.089] debug2: Attempting to connect to
>>>>> localhost:3306
>>>>> [2024-05-29T20:51:30.090] slurmdbd version 19.05.5 started
>>>>> [2024-05-29T20:51:30.090] debug2: running rollup at Wed May 29
>>>>> 20:51:30 2024
>>>>> [2024-05-29T20:51:30.091] debug2: Everything rolled up
>>>>> [2024-05-29T20:51:49.673] Terminate signal (SIGINT or SIGTERM)
>>>>> received
>>>>> [2024-05-29T20:51:49.673] debug: rpc_mgr shutting down
>>>>>
>>>>>
>>>>>
>>>>> my config file looks like this
>>>>>
>>>>> ArchiveEvents=yes
>>>>> ArchiveJobs=yes
>>>>> ArchiveResvs=yes
>>>>> ArchiveSteps=no
>>>>> ArchiveSuspend=no
>>>>> ArchiveTXN=no
>>>>> ArchiveUsage=no
>>>>> PurgeEventAfter=1month
>>>>> PurgeJobAfter=12month
>>>>> PurgeResvAfter=1month
>>>>> PurgeStepAfter=1month
>>>>> PurgeSuspendAfter=1month
>>>>> PurgeTXNAfter=12month
>>>>> PurgeUsageAfter=24month
>>>>> # Authentication info
>>>>> AuthType=auth/munge
>>>>> # slurmDBD info
>>>>> DbdAddr=localhost
>>>>> DbdHost=head-node
>>>>> DbdPort=7032
>>>>> SlurmUser=root
>>>>> MessageTimeout=100
>>>>> DebugLevel=5
>>>>> #DefaultQOS=normal,standby
>>>>> LogFile=/var/log/slurmdbd.log
>>>>> PidFile=/run/slurmdbd.pid
>>>>> #PrivateData=accounts,users,usage,jobs
>>>>> #TrackWCKey=yes
>>>>> #
>>>>> # Database info
>>>>> StorageType=accounting_storage/mysql
>>>>> StorageHost=localhost
>>>>> StoragePort=3306
>>>>> StoragePass=slurmdbpass
>>>>> StorageUser=slurm
>>>>> StorageLoc=slurm_acct_db
>>>>> I used standard names and passwords to get started and I will change
>>>>> later
>>>>>
>>>>> but everytime I try to start slurmdbd.service it crashes and I have
>>>>> that log that I shared with you
>>>>>
>>>>> I use these versions
>>>>>
>>>>> slurmdbd -V
>>>>> slurm-wlm 19.05.5
>>>>> mysql Ver 15.1 Distrib 10.3.39-MariaDB, for debian-linux-gnu (x86_64)
>>>>> using readline 5.2
>>>>> Everything else Is working properly except I cannot get slurmdbd to
>>>>> work and at this point I exhausted all my possible trials :) looking for
>>>>> some expert insights :)
>>>>>
>>>>>
>>>>> Any idea what I am doing wrong here ? Also I didn't compile any slurm
>>>>> package. I used the binary from apt repos
>>>>>
>>>>> Any help will be appreciated
>>>>>
>>>>> Cheers
>>>>>
>>>>> Rad
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> *Rad Aniba, PhD*
>>>
>>>
>>>
>>> --
>>> slurm-users mailing list -- slurm-users(a)lists.schedmd.com
>>> To unsubscribe send an email to slurm-users-leave(a)lists.schedmd.com
>>>
>>
>>
>> --
>> *Rad Aniba, PhD*
>>
>>
>
> --
> *Rad Aniba, PhD*
>
>
> --
> slurm-users mailing list -- slurm-users(a)lists.schedmd.com
> To unsubscribe send an email to slurm-users-leave(a)lists.schedmd.com
>
>
>
--
*Rad Aniba, PhD*
[View Less]
We are pleased to announce the availability of Slurm 24.05.0.
To highlight some new features in 24.05:
- Isolated Job Step management. Enabled on a job-by-job basis with the
--stepmgr option, or globally through SlurmctldParameters=enable_stepmgr.
- Federation - Allow for client command operation while SlurmDBD is
unavailable.
- New MaxTRESRunMinsPerAccount and MaxTRESRunMinsPerUser QOS limits.
- New USER_DELETE reservation flag.
- New Flags=rebootless option on Features for node_features/…
[View More]helpers
which indicates the given feature can be enabled without rebooting the node.
- Cloud power management options: New "max_powered_nodes=<limit>" option
in SlurmctldParamters, and new SuspendExcNodes=<nodes>:<count> syntax
allowing for <count> nodes out of a given node list to be excluded.
- StdIn/StdOut/StdErr now stored in SlurmDBD accounting records for
batch jobs.
- New switch/nvidia_imex plugin for IMEX channel management on NVIDIA
systems.
- New RestrictedCoresPerGPU option at the Node level, designed to ensure
GPU workloads always have access to a certain number of CPUs even when
nodes are running non-GPU workloads concurrently.
The Slurm documentation has also been updated to the 24.05 release.
(Older versions can be found in the archive, linked from the main
documentation page.)
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support
[View Less]
Hi there, SLURM community,
I swear I've done this before, but now it's failing on a new cluster I'm
deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I
run `srun -n 500 hostname`, the task gets queued since there's not 500
available CPU.
Wasn't there an option that allows for this to be run where the first 384
tasks execute, and then the remaining execute when resources free up?
Here's my conf:
# Slurm Cgroup Configs used on controllers and
workersslurm_cgroup_config:…
[View More] CgroupAutomount: yes ConstrainCores: yes
ConstrainRAMSpace: yes ConstrainSwapSpace: yes ConstrainDevices:
yes# Slurm conf file settingsslurm_config: AccountingStorageType:
"accounting_storage/slurmdbd" AccountingStorageEnforce: "limits"
AuthAltTypes: "auth/jwt" ClusterName: "cluster"
AccountingStorageHost : "{{
hostvars[groups['controller'][0]].ansible_hostname }}" DefMemPerCPU:
1024 InactiveLimit: 120 JobAcctGatherType: "jobacct_gather/cgroup"
JobCompType: "jobcomp/none" MailProg: "/usr/bin/mail" MaxArraySize:
40000 MaxJobCount: 100000 MinJobAge: 3600 ProctrackType:
"proctrack/cgroup" ReturnToService: 2 SelectType: "select/cons_tres"
SelectTypeParameters: "CR_Core_Memory" SlurmctldTimeout: 30
SlurmctldLogFile: "/var/log/slurm/slurmctld.log" SlurmdLogFile:
"/var/log/slurm/slurmd.log" SlurmdSpoolDir: "/var/spool/slurm/d"
SlurmUser: "{{ slurm_user.name }}" SrunPortRange: "60000-61000"
StateSaveLocation: "/var/spool/slurm/ctld" TaskPlugin:
"task/affinity,task/cgroup" UnkillableStepTimeout: 120
--
Thanks,
Daniel Healy
[View Less]
Hi All,
I'm managing a cluster with Slurm, consisting of 4 nodes. One of the
compute nodes appears to be experiencing issues. While the front node's
'squeue' command indicates that jobs are running, upon connecting to the
problematic node, I observe no active processes and GPUs are not being
utilized.
[sushil@ccbrc ~]$ sinfo -Nel
Wed May 29 12:00:08 2024
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT
AVAIL_FE REASON
gag 1 defq* mixed 48 2:…
[View More]24:1 370000 0 1
(null) none
gag 1 glycore mixed 48 2:24:1 370000 0 1
(null) none
glyco1 1 defq* completing* 128 2:64:1 500000 0 1
(null) none
glyco1 1 glycore completing* 128 2:64:1 500000 0 1
(null) none
glyco2 1 defq* mixed 128 2:64:1 500000 0 1
(null) none
glyco2 1 glycore mixed 128 2:64:1 500000 0 1
(null) none
mannose 1 defq* mixed 24 2:12:1 180000 0 1
(null) none
mannose 1 glycore mixed 24 2:12:1 180000 0 1
(null) none
On glyco1 (affected node!):
squeue # gets stuck
sudo systemctl restart slurmd # gets stuck
I tried the following to clear the jobs stuck in CG state, but any new job
appears to be stuck in a 'running' state without actually running.
scontrol update nodename=glyco1 state=down reason=cg
scontrol update nodename=glyco1 state=resume reason=cg
There is no I/O issue in that node, and all file systems are under 30% in
use. Any advice on how to resolve this without rebooting the machine?
Best,
Sushil
[View Less]
Hello all,
I’m sorry if this has been asked and answered before, but I couldn’t find anything related.
Does anyone know whether a framework of sorts exists that allow to change certain SLURM configuration parameters provided some conditions in the batch system’s state are detected and of course are revert if the state became the old one again?
(To be more concrete: We like to raise or unset maxjobPU to run as much small jobs as possible to allocate all nodes as soon as certain threshold …
[View More]of free nodes are available and of course some other scenarios)
Many thanks in advance.
Cheers,
-Frank
Max-Planck-Institut
für Sonnensystemforschung
Justus-von-Liebig-Weg 3
D-37077 Göttingen
Phone: [+49] 551 – 384 979 320
E-Mail: <mailto:heckes@mps.mpg.de> heckes(a)mps.mpg.de
[View Less]
Hello,
*Background :*
I am working on a small cluster that is managed by Base Command Manager
v10.0 using Slurm 23.02.7 with Ubuntu 22.04.2. I have a small testing
script that simply consumes memory and processors.
I run my test script, it consumes more memory than allocated by Slurm and
as expected it gets killed by OOM killer. In /var/log/slurmd, I see
entries like :
[2024-05-29T08:53:04.975] Launching batch job 65 for UID 1001
[2024-05-29T08:53:05.016] [65.batch] task/cgroup: …
[View More]_memcg_initialize: job:
alloc=10868MB mem.limit=10868MB memsw.limit=10868MB job_swappiness=1
[2024-05-29T08:53:05.016] [65.batch] task/cgroup: _memcg_initialize: step:
alloc=10868MB mem.limit=10868MB memsw.limit=10868MB job_swappiness=1
[2024-05-29T08:53:19.530] [65.batch] task/cgroup:
task_cgroup_memory_check_oom: StepId=65.batch hit memory+swap limit at
least once during execution. This may or may not result in some failure.
[2024-05-29T08:53:19.563] [65.batch] done with job
Inspecting with sacct, I see :
$ sacct -j 65 --format="jobid,jobname,state,exitcode"
JobID JobName State ExitCode
------------ ---------- ---------- --------
65 wrap FAILED 9:0
65.batch batch FAILED 9:0
Based on my previous experience with a RHEL based Slurm cluster, I would
expect the state to be listed as OUT_OF_MEMORY and the exitcode to be
0:125.
*Question : *
1. How do I configure [slurm,cgroup].conf such that when Slurm kills a job
due to exceeding the allocated memory, sacct reports the state as
OUT_OF_MEMORY?
See below for my configuration :
*Configuration:*
/etc/default/grub :
~# grep -v "^#" /etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="biosdevname=0 cgroup_enable=memory swapaccount=1"
GRUB_GFXMODE="1024x768,800x600,auto"
GRUB_BACKGROUND="/boot/grub/bcm.png"
slurm.conf :
~# grep -v "^#" /cm/shared/apps/slurm/var/etc/bcm10-slurm/slurm.conf
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
SlurmdSpoolDir=/cm/local/apps/slurm/var/spool
SwitchType=switch/none
MpiDefault=pmix
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/cgroup
ReturnToService=2
TaskPlugin=task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=300
SlurmdTimeout=300
Waittime=0
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
SlurmctldHost=bcm10-h01
AccountingStorageHost=master
NodeName=bcm10-n[01,02] Procs=4 CoresPerSocket=4 RealMemory=15988
SocketsPerBoard=1 ThreadsPerCore=1 Boards=1 MemSpecLimit=5120
Feature=location=local
PartitionName="defq" Default=YES MinNodes=1 DefaultTime=UNLIMITED
MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1
OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL
Nodes=bcm10-n[01,02]
ClusterName=bcm10-slurm
SchedulerType=sched/backfill
StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave/bcm10-slurm
PrologFlags=Alloc
GresTypes=gpu
Prolog=/cm/local/apps/cmd/scripts/prolog
Epilog=/cm/local/apps/cmd/scripts/epilog
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
cgroup.conf
~# grep -v "^#" /cm/shared/apps/slurm/var/etc/bcm10-slurm/cgroup.conf
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=no
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes
AllowedRamSpace=100.00
AllowedSwapSpace=0.00
MemorySwappiness=1
MaxRAMPercent=100.00
MaxSwapPercent=100.00
MinRAMSpace=30
Best regards,
Lee
[View Less]