Dear Slurm-user list,
when a job fails because the node startup fails (cloud scheduling), the
job should be re-queued:
Resume Timeout
Maximum time permitted (in seconds) between when a node resume
request is issued and when the node is actually available for use.
Nodes which fail to respond in this time frame will be marked DOWN
and the jobs scheduled on the node requeued.
however, instead of requeuing the job, it is killed.
[2024-11-18T10:41:52.003] node bibigrid-worker-…
[View More]wubqboa1z2kkgx0-0 not
resumed by ResumeTimeout(1200) - marking down and power_save
[2024-11-18T10:41:52.003] Killing JobId=1 on failed node
bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.046] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:52.046] power down request repeating for node
bibigrid-worker-wubqboa1z2kkgx0-0
Our ResumeProgram does not change the state of the underlying workers, I
think we should set the nodes to DOWN explicitly if the startup fails given:
*ResumeProgram* is unable to restore a node to service with a
responding slurmd and an updated BootTime, it should set the node
state to DOWN, which will result in a requeue of any job associated
with the node - this will happen automatically if the node doesn't
register within ResumeTimeout
but in any case as we can see in the log the job should be requeued
based on it reaching the ResumeTimeout alone. I am unsure why that is
not happening. The power down request is sent by the ResumeFailProgram.
We have SlurmctldParameters=idle_on_node_suspend activated, but that
shouldn't affect Resume, I guess.
My Slurm version is slurm 23.11.5
Best regards,
Xaver
# More context
## Slurmctld from submitting job to failure
[2024-11-18T10:21:45.490] sched: _slurm_rpc_allocate_resources JobId=1
NodeList=bibigrid-worker-wubqboa1z2kkgx0-0 usec=1221
[2024-11-18T10:21:45.499] debug: sackd_mgr_dump_state: saved state of 0
nodes
[2024-11-18T10:21:58.387] debug: sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:21:58.387] debug: sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:22:20.003] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:23:20.003] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:23:20.009] debug: sackd_mgr_dump_state: saved state of 0
nodes
[2024-11-18T10:23:23.003] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:23:23.398] debug: sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:23:23.398] debug: sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:23:53.398] debug: sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:23:53.398] debug: sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:24:21.000] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:24:21.484] slurmscriptd: error: _run_script: JobId=0
resumeprog exit status 1:0
[2024-11-18T10:25:20.003] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:26:02.000] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:26:02.417] debug: sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:26:02.417] debug: sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:26:20.007] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:26:32.417] debug: sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:26:32.417] debug: sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:27:20.003] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:28:20.003] debug: Updating partition uid access list
[2024-11-18T10:28:20.003] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:28:20.008] debug: sackd_mgr_dump_state: saved state of 0
nodes
[2024-11-18T10:29:20.003] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:29:22.000] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:29:22.448] debug: sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:29:22.448] debug: sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:30:20.007] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:31:20.003] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:32:21.000] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:32:42.000] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:32:42.478] debug: sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:32:42.478] debug: sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:33:12.479] debug: sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:33:12.479] debug: sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:33:20.003] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:33:20.010] debug: sackd_mgr_dump_state: saved state of 0
nodes
[2024-11-18T10:34:20.003] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:35:20.007] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:36:01.004] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:36:01.504] debug: sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:36:01.504] debug: sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:36:20.003] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:36:31.505] debug: sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:36:31.505] debug: sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:37:21.000] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:38:20.008] debug: Updating partition uid access list
[2024-11-18T10:38:20.008] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:38:20.017] debug: sackd_mgr_dump_state: saved state of 0
nodes
[2024-11-18T10:39:20.003] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:39:21.003] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:39:21.530] debug: sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:39:21.530] debug: sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:39:51.531] debug: sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:39:51.531] debug: sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:40:21.000] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:41:20.003] debug: sched: Running job scheduler for full
queue.
[2024-11-18T10:41:52.003] node bibigrid-worker-wubqboa1z2kkgx0-0 not
resumed by ResumeTimeout(1200) - marking down and power_save
[2024-11-18T10:41:52.003] Killing JobId=1 on failed node
bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.046] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:52.046] power down request repeating for node
bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.047] debug: sackd_mgr_dump_state: saved state of 0
nodes
[2024-11-18T10:41:52.549] debug: sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:41:52.549] debug: sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:41:52.736] _slurm_rpc_complete_job_allocation: JobId=1
error Job/step already completing or completed
[2024-11-18T10:41:53.000] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:41:53.000] debug: sched: Running job scheduler for
default depth.
[2024-11-18T10:41:53.014] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:53.014] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 state set to IDLE
[View Less]
I am doing a new install of slurm 24.05.3 I have all the packages built and installed on headnode and compute node with the same munge.key, slurm.conf, and gres.conf file. I was able to run munge and unmunge commands to test munge successfully. Time is synced with chronyd. I can't seem to find any useful errors in the logs. For some reason when I run sinfo no nodes are listed. I just see the headers for each column. Has anyone seen this or know what a next step of troubleshooting would be? I'm …
[View More]new to this and not sure where to go from here. Thanks for any and all help!
The odd output I am seeing
[username@headnode ~] sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
(Nothing is output showing status of partition or nodes)
Slurm.conf
ClusterName=slurmkvasir
SlurmctldHost=kadmin2
MpiDefault=none
ProctrackType=proctrack/cgroup
PrologFlags=contain
ReturnToService=2
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/cgroup
MinJobAge=600
SchedulerType=sched/backfill
SelectType=select/cons_tres
PriorityType=priority/multifactor
AccountingStorageHost=localhost
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageTRES=gres/gpu,cpu,node
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmLogFile=/var/log/slurm/slurmd.log
nodeName=k[001-448]
PartitionName=default Nodes=k[001-448] Default=YES MaxTime=INFINITE State=up
Slurmctld.log
Error: Configured MailProg is invalid
Slurmctld version 24.05.3 started on cluster slurmkvasir
Accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Regisetering slurmctld at port 8617
Error: read_slurm_conf: default partition not set.
Revovered state of 448 nodes
Down nodes: k[002-448]
Recovered information about 0 jobs
Revovered state of 0 reservations
Read_slurm_conf: backup_controller not specified
Select/cons_tres; select_p_reconfigure: select/cons_tres: reconfigure
Running as primary controller
Slurmd.log
Error: Node configuration differs from hardware: CPUS=1:40(hw) Boards=1:1(hw) SocketsPerBoard=1:2(hw) CoresPerSocket=1:20(hw) ThreadsPerCore:1:1(hw)
CPU frequency setting not configured for this node
Slurmd version 24.05.3started
Slurmd started on Wed, 27 Nov 2024 06:51:03 -0700
CPUS=1 Boards=1 Cores=1 Threads=1 Memory=192030 TmpDisk=95201 uptime 166740 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
Error: _forward_thread: failed to k019 (10.142.0.119:6818): Connection timed out
(Above line repeated 20 or so times for different nodes.)
Thanks,
Kent Hanson
[View Less]
We are pleased to announce the availability of the Slurm 24.11 release.
To highlight some new features in 24.11:
- New gpu/nvidia plugin. This does not rely on any NVIDIA libraries, and
will
build by default on all systems. It supports basic GPU detection and
management, but cannot currently identify GPU-to-GPU links, or provide
usage data as these are not exposed by the kernel driver.
- Add autodetected GPUs to the output from "slurmd -C".
- Added new QOS-based reports to "sreport".…
[View More]
- Revamped network I/O with the "conmgr" thread-pool model.
- Added new "hostlist function" syntax for management commands and
configuration files.
- switch/hpe_slingshot - Added support for hardware collectives setup
through
the fabric manager. (Requires SlurmctldParameters=enable_stepmgr)
- Added SchedulerParameters=bf_allow_magnetic_slot configuration option to
allow backfill planning for magnetic reservations.
- Added new "scontrol listjobs" and "liststeps" commands to complement
"listpids", and provide --json/--yaml output for all three subcommands.
- Allow jobs to be submitted against multiple QOSes.
- Added new experimental "oracle" backfill scheduling support, which permits
jobs to be delayed if the oracle function determines the reduced
fragmentation of the network topology is sufficiently advantageous.
- Improved responsiveness of the controller when jobs are requeued by
replacing the "db_index" identifier with a slurmctld-generated unique
identifier. ("SLUID")
- New options to job_container/tmpfs to permit site-specific scripts to
modify the namespace before user steps are launched, and to ensure all
steps are completely captured within that namespace.
The Slurm documentation has also been updated to the 24.11 release.
(Older versions can be found in the archive, linked from the main
documentation page.)
Slurm can be downloaded from https://www.schedmd.com/download-slurm/ .
- Tim
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
[View Less]
I don't know how many times I've read the docs; I keep thinking I understand it, but something is really wrong with prioritisation on our cluster, and we're struggling to understand why.
The setup:
1. We have a group who submit two types of work; production jobs and research jobs.
2. We have two sacctmgr accounts for this; let's call those 'prod' and 'research'.
3. We also have some dedicated hardware that they paid for which can be used only by users associated with the prod …
[View More]account.
Desired behaviour:
1. Usage of their dedicated hardware by production jobs should not hugely decrease the fairshare priority for research jobs in other partitions.
2. Usage of shared hardware should decrease their fairshare priority (whether by production or research jobs)
3. Memory should make a relatively small contribution to TRES usage (it's not normally the constrained resource)
Our approach:
Set TRESBillingWeights for cpu, memory and gres/GPU usage on shared partitions. Typically these are set to: CPU=1.0,Mem=0.25G,GRES/gpu=1.0
Set TRESBillingWeights to something small on the dedicated hardware partition, such as: CPU=0.25
Set PriorityWeightFairshare and PriorityWeightAge to values such that Fairshare dominates when jobs are young, and Age takes over if they've been pending a long time
The observed behaviour:
1. production association jobs have a high priority; this is working well
2. research jobs are still getting heavily penalised in fairshare, and we don't understand why; they seem to have enormous RawUsage, largely coming from memory:
Here's what I see from sshare (sensitive details removed, obviously):
sshare -l -A prod, research -a -o Account,RawUsage,EffectvUsage,FairShare,LevelFS,TRESRunMins%80 | grep -v cpu=0
> '
Account RawUsage EffectvUsage FairShare LevelFS TRESRunMins
-------------------- ----------- ------------- ---------- ---------- --------------------------------------------------------------------------------
prod 1587283 0.884373 0.226149 cpu=81371,mem=669457237,energy=0,node=20610,billing=100833,fs/disk=0,vmem=0,pag+
prod 1082008 0.681681 0.963786 0.366740 cpu=81281,mem=669273429,energy=0,node=20520,billing=100833,fs/disk=0,vmem=0,pag+
prod 505090 0.318202 0.964027 0.785664 cpu=90,mem=184320,energy=0,node=90,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=+
research 1043560787 0.380577 0.121648 cpu=17181098808,mem=35196566339054,energy=0,node=4295361360,billing=25773481938+
research 146841 0.000141 0.005311 124.679238 cpu=824,mem=3375923,energy=0,node=824,billing=824,fs/disk=0,vmem=0,pages=0,gres+
research 17530141 0.016798 0.001449 1.044377 cpu=254484,mem=3379938816,energy=0,node=161907,billing=893592,fs/disk=0,vmem=0,+
research 167597 0.000161 0.005070 109.238498 cpu=7275,mem=223516160,energy=0,node=7275,billing=50931,fs/disk=0,vmem=0,pages=+
research 12712481 0.012182 0.001931 1.440166 cpu=186327,mem=95399526,energy=0,node=23290,billing=232909,fs/disk=0,vmem=0,pag+
research 11521011 0.011040 0.002173 1.589104 cpu=8167,mem=267626086,energy=0,node=8167,billing=65338,fs/disk=0,vmem=0,pages=+
research 9719735 0.009314 0.002414 1.883599 cpu=15020,mem=69214617,energy=0,node=1877,billing=3755,fs/disk=0,vmem=0,pages=0+
research 25004766 0.023961 0.001207 0.732184 cpu=590778,mem=6464600473,energy=0,node=98910,billing=2266887,fs/disk=0,vmem=0,+
research 68938740 0.066061 0.000724 0.265570 cpu=159332,mem=963064985,energy=0,node=89957,billing=192706,fs/disk=0,vmem=0,pa+
research 7359413 0.007052 0.002656 2.487710 cpu=81401,mem=583487624,energy=0,node=20350,billing=20350,fs/disk=0,vmem=0,page+
research 718714430 0.688714 0.000241 0.025473 cpu=20616,mem=337774728,energy=0,node=5154,billing=92772,fs/disk=0,vmem=0,pages+
research 1016606 0.000974 0.003863 18.009010 cpu=17179774580,mem=35184178340113,energy=0,node=4294943645,billing=25769661870+
Firstly, why are the mem TRES numbers so enormous?
Secondly, what's going on with the last user, where the rawusage is tiny, but the TRESRunMins is ridiculously big? That could be messing up the whole thing.
Thanks in advance for any advice (either that can help explain what I've misunderstood, or to suggestions of "there's a better way to achieve what you want")
Tim
--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca
________________________________
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com<https://www.astrazeneca.com>
[View Less]
We have a 8 GPU server in which one GPU has gone into an error state that
will require a reboot to clear. I have jobs on the server running on good
GPUs that will take another 3 days to complete. In the meantime, I would
like short jobs to run on the good free GPUs till I reboot.
I set a reservation for the time window I plan to reboot on the whole node
with
scontrol create reservation reservationName=rtx-01_reboot users=root
starttime=2024-11-25T06:00:00 duration=720 Nodes=rtx-01 …
[View More]flags=maint,ignore_jobs
But I would like to set a reservation on just the bad GPU (gpu_id=7) from
now till 2024-11-25T06:00:00 so no job runs that will use it.
Is that possible?
---------------------------------------------------------------
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129 USA
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
[View Less]
Hi,
I am using an old slurm version 20.11.8 and we had to reboot our cluster
today for maintenance. I suspended all the jobs on it with the command
scontrol suspend list_job_ids and all the jobs paused and were suspended.
However, when I tried to resume them after the reboot, scontrol resume did
not work (it was showing in the reason column " (JobHeldAdmin)". I was able
to release them with scontrol release and the jobs started to run back.
However, the SLURM recorded time on it resetted (…
[View More]Time columns, showing 0:00
for all the jobs) though the jobs seem to have re-started from the last
point before he got suspended.
1- Did I follow the right procedure to suspend, reboot and resume/release?
2- In this case, does the wall time for all the jobs goes into reset and
therefore anyone with slurm admin rights will be able to have their jobs
last longer than the wall time limit by suspending and resuming a job?
Best,
*Fritz Ratnasamy*
Data Scientist
Information Technology
[View Less]
Hi,
I compiled and installed Slurm 24.05 on Ubuntu 22.04 following this
tutorial: https://www.schedmd.com/slurm/installation-tutorial/
Systemd service files are from deb packages that result from this.
Do I have to worry that slurmctld and slurmd don't write PID files
although SlurmctldPidFile and SlurmdPidFile are defined in slurm.conf?
Paths for PID files exist and are writeable, logs don't show any error.
slurmdbd does write a PID file as defined in slurmdbd.conf.
thx
Matthias