Hello,
Just to add some context here. We plan to use slurm for developing a sched solution which interacts with a backend system.
Now, the backend system has pieces of h/w which require specific host in the allocation to be the primary/master host wherein the initial task would be launched, this in turn is driven by the job's placement orientation on the h/w itself.
So, our primary task should launch in the asked primary host while secondary / remote tasks would subsequently get started on …
[View More]other hosts.
Hope this brings some context to the problem as to why a specific host is necessary to be the starting host.
Regards,Bhaskar.
On Thursday 31 October, 2024 at 12:04:37 am IST, Laura Hild <lsh(a)jlab.org> wrote:
I think if you tell the list why you care which of the Nodes is BatchHost, they may be able to provide you with a better solution.
________________________________________
Od: Bhaskar Chakraborty via slurm-users <slurm-users(a)lists.schedmd.com>
Poslano: sreda, 30. oktober 2024 12:35
Za: slurm-users(a)schedmd.com
Zadeva: [slurm-users] Change primary alloc node
Hi,
Is there a way to change/control the primary node (i.e. where the initial task starts) as part of a job's allocation.
For eg, if a job requires 6 CPUs & its allocation is distributed over 3 hosts h1, h2 & h3 I find that it always starts the task in 1 particular
node (say h1) irrespective of how many slots were available in the hosts.
Can we somehow let slurm have the primary node as h2?
Is there any C-API inside select plugin which can do this trick if we were to control it through the configured select plugin?
Thanks.
-Bhaskar.
[View Less]
I have set AllowAccounts=sunlabc5hpc,root, but it doesn’t seem to work. User c010637 is not part of the sunlabc5hpc account but is still able to use the sunlabc5hpc partition. I have tried setting EnforcePartLimits to ALL, ANY, and NO, but none of these options resolved the issue.
[c010637@sl-login ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpu* up infinite 3 mix sl-c[0035,0042-0043]
cpu* up infinite 1 idle sl-c0036
gpu up …
[View More]infinite 3 idle sl-c[0045-0047]
sunlabc5hpc up infinite 1 idle sl-c0048
[c010637@sl-login ~]$ scontrol show partition sunlabc5hpc
PartitionName=sunlabc5hpc
AllowGroups=ALL AllowAccounts=sunlabc5hpc,root AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=sl-c0048
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=256 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=256,mem=515000M,node=1,billing=256,gres/gpu=8
[c010637@sl-login ~]$ sacctmgr list assoc format=cluster,user,account%20,qos user=$USER
Cluster User Account QOS
---------- ---------- -------------------- --------------------
snowhpc c010637 c010637_bank normal
[c010637@sl-login ~]$ sacctmgr list account sunlabc5hpc
Account Descr Org
---------- -------------------- --------------------
sunlabc5h+ sunlabc5hpc sunlabc5hpc
[c010637@sl-login ~]$ sacctmgr show assoc where Account=sunlabc5hpc format=User,Account
User Account
---------- ----------
sunlabc5h+
c010751 sunlabc5h+
snowdai sunlabc5h+
[View Less]
Thanks for all your help. So it seems we can skip the trouble of compiling SLURM over different mariadb versions.
Tianyang Zhang
SJTU Network Information Center
发件人: Sid Young <sid.young(a)gmail.com>
发送时间: 2024年10月30日 7:19
收件人: Andrej Sec <andrej.sec(a)savba.sk>
抄送: taleintervenor(a)sjtu.edu.cn; slurm-users(a)lists.schedmd.com
主题: Re: [slurm-users] Re: 转发: What is the safe upgrade path when upgrade from slurm21.08 and mariadb5.5?
I recently upgraded from 20.11 to 24.…
[View More]05.2, before moving the cluster from CentOS 7.9 to Oracle Linux 8.10
The DB upgrade should be pretty simple, do a mysqldump first, then uninstall the old DB, change the repo's and install the new DB version. It should recognise the DB files on disk and access them. Do another DB backup on the new DB version. then roll through the Slurm upgrades.
I picked the first and last version of each release, and systematically went through each node till it was done. First the slurm controller node, then the compute nodes. To avoid Job loss, drain the nodes or you end up with a situation where the slurmd can't talk to the running slurmstepd and the job(s) gets lost. (Shows as a "Protocol Error").
Ole sent me a link to this guide which mostly worked.
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrade-slurm…
Sid Young
W: https://off-grid-engineering.com
On Tue, Oct 29, 2024 at 6:33 PM Andrej Sec via slurm-users <slurm-users(a)lists.schedmd.com <mailto:slurm-users@lists.schedmd.com> > wrote:
Hi,
we are facing a similar task. We have a Slurm 22.05 / MariaDB 5.5.68 environment and want to upgrade to a newer version. According to the documentation, it’s recommended to upgrade from 22.05 to a maximum of 23.11 in one step. With the MariaDB upgrade, there’s a challenge between 10.1 and 10.2+ due to incompatible changes (https://mariadb.com/kb/en/changes-improvements-in-mariadb-10-2). This upgrade, as I understand from the documentation, requires at least slurm 22.05, where it is automatically handled by the slurmdbd service.
In the test lab, we performed the following tests:
a. Incremental upgrade - according to MariaDB recommendations:
1. Upgrade MariaDB 5.5.68 -> 10.1.48 -> 10.2.44
2. Start the Slurm suite 22.05, checking content after each MariaDB upgrade step. During the 10.1 -> 10.2 upgrade, the slurmdbd service automatically converted the database to the required format. We had enabled general.log in MariaDB, allowing detailed inspection of database changes during conversion.
3. Upgrade slurmdbd to version 23.11
4. Upgrade slurmctld to version 23.11
5. Upgrade slurmd to version 23.11
6. Check the database content and compare tests before and after the upgrade (we used various reports with scontrol, sreport, sacct, sacctmgr for verification).
b. Direct MariaDB upgrade from 5.5.68 to 10.2.44 using the same approach. According to the tests, this resulted in the same state as the incremental approach.
PS: If you proceed with the upgrade, I would appreciate it if you could let us know about any potential challenges you encountered.
Andrej Sec
nscc, Bratislava, Slovakia
_____
Od: "hermes via slurm-users" <slurm-users(a)lists.schedmd.com <mailto:slurm-users@lists.schedmd.com> >
Komu: slurm-users(a)lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>
Odoslané: pondelok, 28. október 2024 8:48:19
Predmet: [slurm-users] =?eucgb2312_cn?q?=D7=AA=B7=A2=3A_What_is_the_safe_upgrade_path_when_upgrade_from_slurm21=2E08_and_mariadb5=2E5=3F?=
Hi everyone:
We are currently running business on SLURM21.08 and mariadb5.5.
When talking about the upgrade, we need to keep all the users and jobs history data. And we see the official document wrote:
“When upgrading an existing accounting database to MariaDB 10.2.1 or later from an older version of MariaDB or any version of MySQL, ensure you are running slurmdbd 22.05.7 or later. These versions will gracefully handle changes to MariaDB default values that can cause problems for slurmdbd.”
So is this mean we have to firstly build SLURM>22.05 over mariadb5.5, and do the SLURM upgrade. Then upgrade the mariadb to newer version, and rebuild the same version of SLURM over new mariadb-devel?
And is it safe to jump directly from mariadb5.5 to latest version? How can we check whether the slurm have correctly inherited the historical data?
Thanks,
Tianyang Zhang
SJTU Network Information Center
--
slurm-users mailing list -- slurm-users(a)lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to slurm-users-leave(a)lists.schedmd.com <mailto:slurm-users-leave@lists.schedmd.com>
--
slurm-users mailing list -- slurm-users(a)lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to slurm-users-leave(a)lists.schedmd.com <mailto:slurm-users-leave@lists.schedmd.com>
[View Less]
Hi everyone:
We are currently running business on SLURM21.08 and mariadb5.5.
When talking about the upgrade, we need to keep all the users and jobs
history data. And we see the official document wrote:
“When upgrading an existing accounting database to MariaDB 10.2.1 or later
from an older version of MariaDB or any version of MySQL, ensure you are
running slurmdbd 22.05.7 or later. These versions will gracefully handle
changes to MariaDB default values that can cause problems for …
[View More]slurmdbd.”
So is this mean we have to firstly build SLURM>22.05 over mariadb5.5, and do
the SLURM upgrade. Then upgrade the mariadb to newer version, and rebuild
the same version of SLURM over new mariadb-devel?
And is it safe to jump directly from mariadb5.5 to latest version? How can
we check whether the slurm have correctly inherited the historical data?
Thanks,
Tianyang Zhang
SJTU Network Information Center
[View Less]
Hi,
Is there an option in slurm to launch a custom script at the time of job submission through sbatchor salloc? The script should run with submit user permission in submit area.
The idea is that we need to enquire something which characterises our job’s requirement like CPUslots, memory etc from a central server and we do need read access to user area prior to that.
In our use case the user doesn’t necessarily know beforehand what kind of resource his job needs.(Hence, the need for such a …
[View More]script which will contact the server with user area info.)
Based on it we can modify the job a little later. A post submit script, if available, would inform us the slurm job id as well, it would get called just after the job has entered the system and prior to its scheduling.
Thanks,Bhaskar.
Sent from Yahoo Mail for iPad
[View Less]
We have a 'gpu' partition with 30 or so nodes, some with A100s, some with
H100s, and a few others.
It appears that when (for example) all of the A100 GPUs are in use, if
there are additional jobs requesting A100 GPUs pending, and those jobs have
the highest priority in the partition, then jobs submitted for H100s won't
run even if there are idle H100s. This is a small subset of our present
pending queue- the four bottom jobs should be running, but aren't. The top
pending job shows reason '…
[View More]Resources' while the rest all show 'Priority'.
Any thoughts on why this might be happening?
JOBID PRIORITY TRES_ALLOC
8317749 501490
cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8317750 501490
cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8317745 501490
cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8317746 501490
cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8338679 500060
cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
8338678 500060
cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
8338677 500060
cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
8338676 500060
cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
Thanks,
Kevin
--
Kevin Hildebrand
University of Maryland
Division of IT
[View Less]
I am unable to limit the number of jobs per user per partition. I
have searched the internet, the forums and the slurm documentation.
I created a partition with a QOS having MaxJobsPU=1 and MaxJobsPA=1
Created a user stephen with account=stephen and MaxJobs=1
However if I sbatch a test job (sleep 180) multiple times they all
run concurrently. I am at a loss of what else to do. Help would be very
appreciated .
Thank you
--
Stephen Connolly
JSI Data Systems Ltd
613-727-9353
stephen(a)jsidata.ca
Hello everyone,
I’ve recently encountered an issue where some nodes in our cluster enter
a drain state randomly, typically after completing long-running jobs.
Below is the output from the |sinfo| command showing the reason *“Prolog
error”* :
|root@controller-node:~# sinfo -R REASON USER TIMESTAMP NODELIST Prolog
error slurm 2024-09-24T21:18:05 node[24,31] |
When checking the |slurmd.log| files on the nodes, I noticed the
following errors:
|[2024-09-24T17:18:22.386] [217703.extern] …
[View More]error:
_handle_add_extern_pid_internal: Job 217703 can't add pid 3311892 to
jobacct_gather plugin in the extern_step. **(repeated 90 times)**
[2024-09-24T17:18:22.917] [217703.extern] error:
_handle_add_extern_pid_internal: Job 217703 can't add pid 3313158 to
jobacct_gather plugin in the extern_step. ... [2024-09-24T21:17:45.162]
launch task StepId=217703.0 request from UID:54059 GID:1600
HOST:<SLURMCTLD_IP> PORT:53514 [2024-09-24T21:18:05.166] error: Waiting
for JobId=217703 REQUEST_LAUNCH_PROLOG notification failed, giving up
after 20 sec [2024-09-24T21:18:05.166] error: slurm_send_node_msg:
[(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC_MSG) failed:
Unexpected missing socket error [2024-09-24T21:18:05.166] error:
_rpc_launch_tasks: unable to send return code to
address:port=<SLURMCTLD_IP>:53514 msg_type=6001: No such file or directory |
If you know how to solve these errors, please let me know. I would
greatly appreciate any guidance or suggestions for further troubleshooting.
Thank you in advance for your assistance.
Best regards,
--
Télécom Paris <https://www.telecom-paris.fr>
*Nacereddine LADDAOUI*
Ingénieur de Recherche et de Développement
19 place Marguerite Perey
CS 20031
91123 Palaiseau Cedex
Site web Télécom Paris <https://www.telecom-paris.fr>X Télécom Paris
<https://twitter.com/TelecomParis_>Facebook Télécom Paris
<https://www.facebook.com/TelecomParis>LinkedIn Télécom Paris
<https://www.linkedin.com/school/telecom-paris/>Instagram Télécom Paris
<https://www.instagram.com/telecom_paris/>Blog Télécom Paris
<https://imtech.wp.imt.fr>
Une école de l'IMT <https://www.imt.fr>
[View Less]
Has anyone else noticed, somewhere between versions 22.05.11 and 23.11.9, losing fixed Features defined for a node in slurm.conf, and instead now just having those controlled by a NodeFeaturesPlugin like node_features/knl_generic?