We have a 'gpu' partition with 30 or so nodes, some with A100s, some with
H100s, and a few others.
It appears that when (for example) all of the A100 GPUs are in use, if
there are additional jobs requesting A100 GPUs pending, and those jobs have
the highest priority in the partition, then jobs submitted for H100s won't
run even if there are idle H100s. This is a small subset of our present
pending queue- the four bottom jobs should be running, but aren't. The top
pending job shows reason '…
[View More]Resources' while the rest all show 'Priority'.
Any thoughts on why this might be happening?
JOBID PRIORITY TRES_ALLOC
8317749 501490
cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8317750 501490
cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8317745 501490
cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8317746 501490
cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
8338679 500060
cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
8338678 500060
cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
8338677 500060
cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
8338676 500060
cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
Thanks,
Kevin
--
Kevin Hildebrand
University of Maryland
Division of IT
[View Less]
I am unable to limit the number of jobs per user per partition. I
have searched the internet, the forums and the slurm documentation.
I created a partition with a QOS having MaxJobsPU=1 and MaxJobsPA=1
Created a user stephen with account=stephen and MaxJobs=1
However if I sbatch a test job (sleep 180) multiple times they all
run concurrently. I am at a loss of what else to do. Help would be very
appreciated .
Thank you
--
Stephen Connolly
JSI Data Systems Ltd
613-727-9353
stephen(a)jsidata.ca
Hello everyone,
I’ve recently encountered an issue where some nodes in our cluster enter
a drain state randomly, typically after completing long-running jobs.
Below is the output from the |sinfo| command showing the reason *“Prolog
error”* :
|root@controller-node:~# sinfo -R REASON USER TIMESTAMP NODELIST Prolog
error slurm 2024-09-24T21:18:05 node[24,31] |
When checking the |slurmd.log| files on the nodes, I noticed the
following errors:
|[2024-09-24T17:18:22.386] [217703.extern] …
[View More]error:
_handle_add_extern_pid_internal: Job 217703 can't add pid 3311892 to
jobacct_gather plugin in the extern_step. **(repeated 90 times)**
[2024-09-24T17:18:22.917] [217703.extern] error:
_handle_add_extern_pid_internal: Job 217703 can't add pid 3313158 to
jobacct_gather plugin in the extern_step. ... [2024-09-24T21:17:45.162]
launch task StepId=217703.0 request from UID:54059 GID:1600
HOST:<SLURMCTLD_IP> PORT:53514 [2024-09-24T21:18:05.166] error: Waiting
for JobId=217703 REQUEST_LAUNCH_PROLOG notification failed, giving up
after 20 sec [2024-09-24T21:18:05.166] error: slurm_send_node_msg:
[(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC_MSG) failed:
Unexpected missing socket error [2024-09-24T21:18:05.166] error:
_rpc_launch_tasks: unable to send return code to
address:port=<SLURMCTLD_IP>:53514 msg_type=6001: No such file or directory |
If you know how to solve these errors, please let me know. I would
greatly appreciate any guidance or suggestions for further troubleshooting.
Thank you in advance for your assistance.
Best regards,
--
Télécom Paris <https://www.telecom-paris.fr>
*Nacereddine LADDAOUI*
Ingénieur de Recherche et de Développement
19 place Marguerite Perey
CS 20031
91123 Palaiseau Cedex
Site web Télécom Paris <https://www.telecom-paris.fr>X Télécom Paris
<https://twitter.com/TelecomParis_>Facebook Télécom Paris
<https://www.facebook.com/TelecomParis>LinkedIn Télécom Paris
<https://www.linkedin.com/school/telecom-paris/>Instagram Télécom Paris
<https://www.instagram.com/telecom_paris/>Blog Télécom Paris
<https://imtech.wp.imt.fr>
Une école de l'IMT <https://www.imt.fr>
[View Less]
Has anyone else noticed, somewhere between versions 22.05.11 and 23.11.9, losing fixed Features defined for a node in slurm.conf, and instead now just having those controlled by a NodeFeaturesPlugin like node_features/knl_generic?
Slurm version 24.05.4 is now available and includes a fix for a recently
discovered security issue with the new stepmgr subsystem.
SchedMD customers were informed on October 9th and provided a patch on
request; this process is documented in our security policy. [1]
A mistake in authentication handling in stepmgr could permit an attacker
to execute processes under other users' jobs. This is limited to jobs
explicitly running with --stepmgr, or on systems that have globally
enabled stepmgr …
[View More]through "SlurmctldParameters=enable_stepmgr" in their
configuration. CVE-2024-48936.
Downloads are available at https://www.schedmd.com/downloads.php .
Release notes follow below.
- Tim
[1] https://www.schedmd.com/security-policy/
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
> * Changes in Slurm 24.05.4
> ==========================
> -- Fix generic int sort functions.
> -- Fix user look up using possible unrealized uid in the dbd.
> -- Fix FreeBSD compile issue with tls/none plugin.
> -- slurmrestd - Fix regressions that allowed slurmrestd to be run as SlurmUser
> when SlurmUser was not root.
> -- mpi/pmix fix race conditions with het jobs at step start/end which could
> make srun to hang.
> -- Fix not showing some SelectTypeParameters in scontrol show config.
> -- Avoid assert when dumping removed certain fields in JSON/YAML.
> -- Improve how shards are scheduled with affinity in mind.
> -- Fix MaxJobsAccruePU not being respected when MaxJobsAccruePA is set
> in the same QOS.
> -- Prevent backfill from planning jobs that use overlapping resources for the
> same time slot if the job's time limit is less than bf_resolution.
> -- Fix memory leak when requesting typed gres and --[cpus|mem]-per-gpu.
> -- Prevent backfill from breaking out due to "system state changed" every 30
> seconds if reservations use REPLACE or REPLACE_DOWN flags.
> -- slurmrestd - Make sure that scheduler_unset parameter defaults to true even
> when the following flags are also set: show_duplicates, skip_steps,
> disable_truncate_usage_time, run_away_jobs, whole_hetjob,
> disable_whole_hetjob, disable_wait_for_result, usage_time_as_submit_time,
> show_batch_script, and or show_job_environment. Additionaly, always make
> sure show_duplicates and disable_truncate_usage_time default to true when
> the following flags are also set: scheduler_unset, scheduled_on_submit,
> scheduled_by_main, scheduled_by_backfill, and or job_started. This effects
> the following endpoints:
> 'GET /slurmdb/v0.0.40/jobs'
> 'GET /slurmdb/v0.0.41/jobs'
> -- Ignore --json and --yaml options for scontrol show config to prevent mixing
> output types.
> -- Fix not considering nodes in reservations with Maintenance or Overlap flags
> when creating new reservations with nodecnt or when they replace down nodes.
> -- Fix suspending/resuming steps running under a 23.02 slurmstepd process.
> -- Fix options like sprio --me and squeue --me for users with a uid greater
> than 2147483647.
> -- fatal() if BlockSizes=0. This value is invalid and would otherwise cause the
> slurmctld to crash.
> -- sacctmgr - Fix issue where clearing out a preemption list using
> preempt='' would cause the given qos to no longer be preempt-able until set
> again.
> -- Fix stepmgr creating job steps concurrently.
> -- data_parser/v0.0.40 - Avoid dumping "Infinity" for NO_VAL tagged "number"
> fields.
> -- data_parser/v0.0.41 - Avoid dumping "Infinity" for NO_VAL tagged "number"
> fields.
> -- slurmctld - Fix a potential leak while updating a reservation.
> -- slurmctld - Fix state save with reservation flags when a update fails.
> -- Fix reservation update issues with parameters Accounts and Users, when
> using +/- signs.
> -- slurmrestd - Don't dump warning on empty wckeys in:
> 'GET /slurmdb/v0.0.40/config'
> 'GET /slurmdb/v0.0.41/config'
> -- Fix slurmd possibly leaving zombie processes on start up in configless when
> the initial attempt to fetch the config fails.
> -- Fix crash when trying to drain a non-existing node (possibly deleted
> before).
> -- slurmctld - fix segfault when calculating limit decay for jobs with an
> invalid association.
> -- Fix IPMI energy gathering with multiple sensors.
> -- data_parser/v0.0.39 - Remove xassert requiring errors and warnings to have a
> source string.
> -- slurmrestd - Prevent potential segfault when there is an error parsing an
> array field which could lead to a double xfree. This applies to several
> endpoints in data_parser v0.0.39, v0.0.40 and v0.0.41.
> -- scancel - Fix a regression from 23.11.6 where using both the --ctld and
> --sibling options would cancel the federated job on all clusters instead of
> only the cluster(s) specified by --sibling.
> -- accounting_storage/mysql - Fix bug when removing an association
> specified with an empty partition.
> -- Fix setting multiple partition state restore on a job correctly.
> -- Fix difference in behavior when swapping partition order in job submission.
> -- Fix security issue in stepmgr that could permit an attacker to execute
> processes under other users' jobs. CVE-2024-48936.
[View Less]
I have a SLURM configuration of 2 hosts with 6 + 4 CPUs.
I am submitting jobs with sbatch -n <CPU slots> <job script>.
However, I see that even when I have exhausted all 10 CPU slots for the running jobs it's still allowing subsequent jobs to run !
The CPU slots availability is also show as full for the 2 hosts. No job is found pending.
What could be problem?
My Slurm.conf looks like (host names are changed to generic):
ClusterName=MyClusterControlMachine=host1ControlAddr=<some …
[View More]address>SlurmUser=slurmsa#AuthType=auth/mungeStateSaveLocation=/var/spool/slurmdSlurmdSpoolDir=/var/spool/slurmdSlurmctldLogFile=/var/log/slurm/slurmctld.logSlurmdDebug=3SlurmctldDebug=6SlurmdLogFile=/var/log/slurm/slurmd.logAccountingStorageType=accounting_storage/slurmdbdAccountingStorageHost=host1#AccountingStoragePass=medslurmpass#AccountingStoragePass=/var/run/munge/munge.socket.2AccountingStorageUser=slurmsa#TaskPlugin=task/cgroupNodeName=host1 CPUs=6 SocketsPerBoard=3 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWNNodeName=host2 CPUs=4 ThreadsPerCore=1 State=UNKNOWNPartitionName=debug Nodes=host1,host2 Default=YES MaxTime=INFINITE State=UPJobAcctGatherType=jobacct_gather/linuxJobAcctGatherFrequency=30
SelectType=select/cons_tresSelectTypeParameters=CR_CPUTaskPlugin=task/affinity
Thanks in advance for any help!
Regards,Bhaskar.
[View Less]
Dear SLUR Users and Administrators,
I am interested in a way to customize the job submission exit statuses (mainly error codes) after the job has already been queued by the SLURM controller. We aim to provide more user-friendly messages and reminders in case of any errors or obstacles (also adjusted to our QoS/account system).
For example, in the case of exceeding CPU minutes of given QoS (or account) and after the (successful) job submission, we would like to notify the user that his …
[View More]job has been queued (as it should) but won’t start until the CPU minutes limits are increased (and that he should contact the administrators to apply for more resources). Similarly, if the user queued a job that cannot be launched immediately because of exceeding the MaxJobs limit (per user), we would like to also give him some additional message after the srun/sbatch submission. We want to provide such information immediately after the job submission, without the need to check the status using `squeue` by the user.
In the Job Launch Guide (https://slurm.schedmd.com/job_launch.html) there are distinguished following steps:
1. Call job_submit plugins to modify the request as appropriate
2. Validate that the options are valid for this user (e.g. valid partition name, valid limits, etc.)
3. Determine if this job is the highest priority runnable job, if so then really try to allocate resources for it now, otherwise only validate that it could run if no other jobs existed
4. Determine which nodes could be used for the job. If the feature specification uses an exclusive OR option, then multiple iterations of the selection process below will be required with disjoint sets of nodes
5. Call the select plugin to select the best resources for the request
6. The select plugin will consider network topology and the topology within a node (e.g. sockets, cores, and threads) to select the best resources for the job
7. If the job can not be initiated using available resources and preemption support is configured, the select plugin will also determine if the job can be initiated after preempting lower priority jobs. If so then initiate preemption as needed to start the job.
From my understanding, to achieve our goal one would need to have access to source code or plugin related to point 2 (and some part of point 3). Unfortunately, the job_submit (lua) plugin from point 1 (and the cli_filter plugin as well) cannot be used because it only has access to the information on the parameters of the submitted job and the SLURM partitions (but not the QoS/account usage and their limits).
Is there any way to extend the customization of job submission to include such features?
Best regards,
Sebastian
--
dr inż. Sebastian Sitkiewicz
Politechnika Wrocławska
Wrocławskie Centrum Sieciowo-Superkomputerowe
Dział Usług Obliczeniowych
Wyb. Wyspiańskiego 27
50-370 Wrocław
www.wcss.pl
[View Less]
We are trying to design the charging and accounting system for our new institutional HPC facility and I'm having difficulty understanding exactly how we can use sacctmgr to achieve what we need.
Until now, our previous HPC facilities have all operated as free delivery and we have not needed to track costs by user/group/project. Account codes have been purely optional.
However, our new facility will be split into various resource types, with free partitions and paid/priority/reserved …
[View More]partitions across those resource types.
All jobs will need to be submitted with an account code.
For users submitting to 'free' partitions we don't need to track resource units against a balance, but the submitted account code would still be used for reporting purposes (i.e. "free resources accounted for % of all use by this project in August-September").
When submitting to a 'paid' partition, the account code needs to be checked to ensure it has a positive balance (or a balance that will not go past some negative threshold).
Each of the 'paid' partitions may (will) have a different resource unit cost. A simple example:
- Submit to a generic CPU paid partition
-- 1 resource unit/token/credit/£/$ per allocated cpu, per hour of compute
- Submit to a high-speed, non-blocking CPU paid partition
-- 2 resource unit/token/credit/£/$ per allocated cpu, per hour of compute
- Submit to a GPU paid partition
-- 4 resource unit/token/credit/£/$ per allocated GPU card, per hour of compute
We need to have *one* pool of resource units/tokens/credits per account - let's say 1000 credits, and a group of users may well decide to spend all of their credits on the generic CPU partition, all on the GPU partition, or some mixture of the two.
So in the above examples, assuming one user (or group of users sharing the same account code) submit a 2 hour job to all three partitions, their one, single account code should be charged:
- 2 units for the generic CPU partition
- 4 units for the job on the low latency partition
- 8 units for the gpu partition.
- A total of 14 credits removed from their single account code
Is this feasible to achieve without having to allocate credits to each of the partitions for an account, or creating a QOS variant for each and every combination of account and partition?
John Snowdon
Senior Research Infrastructure Engineer (HPC)
Research Software Engineering
Catalyst Building, Room 2.01
Newcastle University
3 Science Square
Newcastle Helix
Newcastle upon Tyne
NE4 5TG
https://rse.ncldata.dev/
[View Less]
Hey guys !
I'm looking to improve GPU monitoring on our cluster. I want to install
this https://github.com/NVIDIA/dcgm-exporter and saw in the README that
it can support tracking of job id :
https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job…
However I haven't been able to see any examples on how to do it nor does
slurm seem to expose this information by default.
Does anyone do this here ? And if so do you have any examples I could
try to follow ? If you have …
[View More]advise on best practices to monitor GPU I'd
be happy to hear it out !
Regards,
Sylvain Maret
[View Less]