Hello everyone,
I’ve recently encountered an issue where some nodes in our cluster enter
a drain state randomly, typically after completing long-running jobs.
Below is the output from the |sinfo| command showing the reason *“Prolog
error”* :
|root@controller-node:~# sinfo -R REASON USER TIMESTAMP NODELIST Prolog
error slurm 2024-09-24T21:18:05 node[24,31] |
When checking the |slurmd.log| files on the nodes, I noticed the
following errors:
|[2024-09-24T17:18:22.386] [217703.extern] …
[View More]error:
_handle_add_extern_pid_internal: Job 217703 can't add pid 3311892 to
jobacct_gather plugin in the extern_step. **(repeated 90 times)**
[2024-09-24T17:18:22.917] [217703.extern] error:
_handle_add_extern_pid_internal: Job 217703 can't add pid 3313158 to
jobacct_gather plugin in the extern_step. ... [2024-09-24T21:17:45.162]
launch task StepId=217703.0 request from UID:54059 GID:1600
HOST:<SLURMCTLD_IP> PORT:53514 [2024-09-24T21:18:05.166] error: Waiting
for JobId=217703 REQUEST_LAUNCH_PROLOG notification failed, giving up
after 20 sec [2024-09-24T21:18:05.166] error: slurm_send_node_msg:
[(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC_MSG) failed:
Unexpected missing socket error [2024-09-24T21:18:05.166] error:
_rpc_launch_tasks: unable to send return code to
address:port=<SLURMCTLD_IP>:53514 msg_type=6001: No such file or directory |
If you know how to solve these errors, please let me know. I would
greatly appreciate any guidance or suggestions for further troubleshooting.
Thank you in advance for your assistance.
Best regards,
--
Télécom Paris <https://www.telecom-paris.fr>
*Nacereddine LADDAOUI*
Ingénieur de Recherche et de Développement
19 place Marguerite Perey
CS 20031
91123 Palaiseau Cedex
Site web Télécom Paris <https://www.telecom-paris.fr>X Télécom Paris
<https://twitter.com/TelecomParis_>Facebook Télécom Paris
<https://www.facebook.com/TelecomParis>LinkedIn Télécom Paris
<https://www.linkedin.com/school/telecom-paris/>Instagram Télécom Paris
<https://www.instagram.com/telecom_paris/>Blog Télécom Paris
<https://imtech.wp.imt.fr>
Une école de l'IMT <https://www.imt.fr>
[View Less]
Has anyone else noticed, somewhere between versions 22.05.11 and 23.11.9, losing fixed Features defined for a node in slurm.conf, and instead now just having those controlled by a NodeFeaturesPlugin like node_features/knl_generic?
Slurm version 24.05.4 is now available and includes a fix for a recently
discovered security issue with the new stepmgr subsystem.
SchedMD customers were informed on October 9th and provided a patch on
request; this process is documented in our security policy. [1]
A mistake in authentication handling in stepmgr could permit an attacker
to execute processes under other users' jobs. This is limited to jobs
explicitly running with --stepmgr, or on systems that have globally
enabled stepmgr …
[View More]through "SlurmctldParameters=enable_stepmgr" in their
configuration. CVE-2024-48936.
Downloads are available at https://www.schedmd.com/downloads.php .
Release notes follow below.
- Tim
[1] https://www.schedmd.com/security-policy/
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
> * Changes in Slurm 24.05.4
> ==========================
> -- Fix generic int sort functions.
> -- Fix user look up using possible unrealized uid in the dbd.
> -- Fix FreeBSD compile issue with tls/none plugin.
> -- slurmrestd - Fix regressions that allowed slurmrestd to be run as SlurmUser
> when SlurmUser was not root.
> -- mpi/pmix fix race conditions with het jobs at step start/end which could
> make srun to hang.
> -- Fix not showing some SelectTypeParameters in scontrol show config.
> -- Avoid assert when dumping removed certain fields in JSON/YAML.
> -- Improve how shards are scheduled with affinity in mind.
> -- Fix MaxJobsAccruePU not being respected when MaxJobsAccruePA is set
> in the same QOS.
> -- Prevent backfill from planning jobs that use overlapping resources for the
> same time slot if the job's time limit is less than bf_resolution.
> -- Fix memory leak when requesting typed gres and --[cpus|mem]-per-gpu.
> -- Prevent backfill from breaking out due to "system state changed" every 30
> seconds if reservations use REPLACE or REPLACE_DOWN flags.
> -- slurmrestd - Make sure that scheduler_unset parameter defaults to true even
> when the following flags are also set: show_duplicates, skip_steps,
> disable_truncate_usage_time, run_away_jobs, whole_hetjob,
> disable_whole_hetjob, disable_wait_for_result, usage_time_as_submit_time,
> show_batch_script, and or show_job_environment. Additionaly, always make
> sure show_duplicates and disable_truncate_usage_time default to true when
> the following flags are also set: scheduler_unset, scheduled_on_submit,
> scheduled_by_main, scheduled_by_backfill, and or job_started. This effects
> the following endpoints:
> 'GET /slurmdb/v0.0.40/jobs'
> 'GET /slurmdb/v0.0.41/jobs'
> -- Ignore --json and --yaml options for scontrol show config to prevent mixing
> output types.
> -- Fix not considering nodes in reservations with Maintenance or Overlap flags
> when creating new reservations with nodecnt or when they replace down nodes.
> -- Fix suspending/resuming steps running under a 23.02 slurmstepd process.
> -- Fix options like sprio --me and squeue --me for users with a uid greater
> than 2147483647.
> -- fatal() if BlockSizes=0. This value is invalid and would otherwise cause the
> slurmctld to crash.
> -- sacctmgr - Fix issue where clearing out a preemption list using
> preempt='' would cause the given qos to no longer be preempt-able until set
> again.
> -- Fix stepmgr creating job steps concurrently.
> -- data_parser/v0.0.40 - Avoid dumping "Infinity" for NO_VAL tagged "number"
> fields.
> -- data_parser/v0.0.41 - Avoid dumping "Infinity" for NO_VAL tagged "number"
> fields.
> -- slurmctld - Fix a potential leak while updating a reservation.
> -- slurmctld - Fix state save with reservation flags when a update fails.
> -- Fix reservation update issues with parameters Accounts and Users, when
> using +/- signs.
> -- slurmrestd - Don't dump warning on empty wckeys in:
> 'GET /slurmdb/v0.0.40/config'
> 'GET /slurmdb/v0.0.41/config'
> -- Fix slurmd possibly leaving zombie processes on start up in configless when
> the initial attempt to fetch the config fails.
> -- Fix crash when trying to drain a non-existing node (possibly deleted
> before).
> -- slurmctld - fix segfault when calculating limit decay for jobs with an
> invalid association.
> -- Fix IPMI energy gathering with multiple sensors.
> -- data_parser/v0.0.39 - Remove xassert requiring errors and warnings to have a
> source string.
> -- slurmrestd - Prevent potential segfault when there is an error parsing an
> array field which could lead to a double xfree. This applies to several
> endpoints in data_parser v0.0.39, v0.0.40 and v0.0.41.
> -- scancel - Fix a regression from 23.11.6 where using both the --ctld and
> --sibling options would cancel the federated job on all clusters instead of
> only the cluster(s) specified by --sibling.
> -- accounting_storage/mysql - Fix bug when removing an association
> specified with an empty partition.
> -- Fix setting multiple partition state restore on a job correctly.
> -- Fix difference in behavior when swapping partition order in job submission.
> -- Fix security issue in stepmgr that could permit an attacker to execute
> processes under other users' jobs. CVE-2024-48936.
[View Less]
I have a SLURM configuration of 2 hosts with 6 + 4 CPUs.
I am submitting jobs with sbatch -n <CPU slots> <job script>.
However, I see that even when I have exhausted all 10 CPU slots for the running jobs it's still allowing subsequent jobs to run !
The CPU slots availability is also show as full for the 2 hosts. No job is found pending.
What could be problem?
My Slurm.conf looks like (host names are changed to generic):
ClusterName=MyClusterControlMachine=host1ControlAddr=<some …
[View More]address>SlurmUser=slurmsa#AuthType=auth/mungeStateSaveLocation=/var/spool/slurmdSlurmdSpoolDir=/var/spool/slurmdSlurmctldLogFile=/var/log/slurm/slurmctld.logSlurmdDebug=3SlurmctldDebug=6SlurmdLogFile=/var/log/slurm/slurmd.logAccountingStorageType=accounting_storage/slurmdbdAccountingStorageHost=host1#AccountingStoragePass=medslurmpass#AccountingStoragePass=/var/run/munge/munge.socket.2AccountingStorageUser=slurmsa#TaskPlugin=task/cgroupNodeName=host1 CPUs=6 SocketsPerBoard=3 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWNNodeName=host2 CPUs=4 ThreadsPerCore=1 State=UNKNOWNPartitionName=debug Nodes=host1,host2 Default=YES MaxTime=INFINITE State=UPJobAcctGatherType=jobacct_gather/linuxJobAcctGatherFrequency=30
SelectType=select/cons_tresSelectTypeParameters=CR_CPUTaskPlugin=task/affinity
Thanks in advance for any help!
Regards,Bhaskar.
[View Less]
Dear SLUR Users and Administrators,
I am interested in a way to customize the job submission exit statuses (mainly error codes) after the job has already been queued by the SLURM controller. We aim to provide more user-friendly messages and reminders in case of any errors or obstacles (also adjusted to our QoS/account system).
For example, in the case of exceeding CPU minutes of given QoS (or account) and after the (successful) job submission, we would like to notify the user that his …
[View More]job has been queued (as it should) but won’t start until the CPU minutes limits are increased (and that he should contact the administrators to apply for more resources). Similarly, if the user queued a job that cannot be launched immediately because of exceeding the MaxJobs limit (per user), we would like to also give him some additional message after the srun/sbatch submission. We want to provide such information immediately after the job submission, without the need to check the status using `squeue` by the user.
In the Job Launch Guide (https://slurm.schedmd.com/job_launch.html) there are distinguished following steps:
1. Call job_submit plugins to modify the request as appropriate
2. Validate that the options are valid for this user (e.g. valid partition name, valid limits, etc.)
3. Determine if this job is the highest priority runnable job, if so then really try to allocate resources for it now, otherwise only validate that it could run if no other jobs existed
4. Determine which nodes could be used for the job. If the feature specification uses an exclusive OR option, then multiple iterations of the selection process below will be required with disjoint sets of nodes
5. Call the select plugin to select the best resources for the request
6. The select plugin will consider network topology and the topology within a node (e.g. sockets, cores, and threads) to select the best resources for the job
7. If the job can not be initiated using available resources and preemption support is configured, the select plugin will also determine if the job can be initiated after preempting lower priority jobs. If so then initiate preemption as needed to start the job.
From my understanding, to achieve our goal one would need to have access to source code or plugin related to point 2 (and some part of point 3). Unfortunately, the job_submit (lua) plugin from point 1 (and the cli_filter plugin as well) cannot be used because it only has access to the information on the parameters of the submitted job and the SLURM partitions (but not the QoS/account usage and their limits).
Is there any way to extend the customization of job submission to include such features?
Best regards,
Sebastian
--
dr inż. Sebastian Sitkiewicz
Politechnika Wrocławska
Wrocławskie Centrum Sieciowo-Superkomputerowe
Dział Usług Obliczeniowych
Wyb. Wyspiańskiego 27
50-370 Wrocław
www.wcss.pl
[View Less]
We are trying to design the charging and accounting system for our new institutional HPC facility and I'm having difficulty understanding exactly how we can use sacctmgr to achieve what we need.
Until now, our previous HPC facilities have all operated as free delivery and we have not needed to track costs by user/group/project. Account codes have been purely optional.
However, our new facility will be split into various resource types, with free partitions and paid/priority/reserved …
[View More]partitions across those resource types.
All jobs will need to be submitted with an account code.
For users submitting to 'free' partitions we don't need to track resource units against a balance, but the submitted account code would still be used for reporting purposes (i.e. "free resources accounted for % of all use by this project in August-September").
When submitting to a 'paid' partition, the account code needs to be checked to ensure it has a positive balance (or a balance that will not go past some negative threshold).
Each of the 'paid' partitions may (will) have a different resource unit cost. A simple example:
- Submit to a generic CPU paid partition
-- 1 resource unit/token/credit/£/$ per allocated cpu, per hour of compute
- Submit to a high-speed, non-blocking CPU paid partition
-- 2 resource unit/token/credit/£/$ per allocated cpu, per hour of compute
- Submit to a GPU paid partition
-- 4 resource unit/token/credit/£/$ per allocated GPU card, per hour of compute
We need to have *one* pool of resource units/tokens/credits per account - let's say 1000 credits, and a group of users may well decide to spend all of their credits on the generic CPU partition, all on the GPU partition, or some mixture of the two.
So in the above examples, assuming one user (or group of users sharing the same account code) submit a 2 hour job to all three partitions, their one, single account code should be charged:
- 2 units for the generic CPU partition
- 4 units for the job on the low latency partition
- 8 units for the gpu partition.
- A total of 14 credits removed from their single account code
Is this feasible to achieve without having to allocate credits to each of the partitions for an account, or creating a QOS variant for each and every combination of account and partition?
John Snowdon
Senior Research Infrastructure Engineer (HPC)
Research Software Engineering
Catalyst Building, Room 2.01
Newcastle University
3 Science Square
Newcastle Helix
Newcastle upon Tyne
NE4 5TG
https://rse.ncldata.dev/
[View Less]
Hey guys !
I'm looking to improve GPU monitoring on our cluster. I want to install
this https://github.com/NVIDIA/dcgm-exporter and saw in the README that
it can support tracking of job id :
https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job…
However I haven't been able to see any examples on how to do it nor does
slurm seem to expose this information by default.
Does anyone do this here ? And if so do you have any examples I could
try to follow ? If you have …
[View More]advise on best practices to monitor GPU I'd
be happy to hear it out !
Regards,
Sylvain Maret
[View Less]
Hi Everyone,
I'm a new to slurm administration and looking for a bit of help!
Just added Accounting to an existing cluster but job information is not being added to the Accounting Mariadb. When I submit a test job it gets scheduled fine and its visible with squeue, I get nothing returned from sacct!
I have turned up the logging to debug5 on both slurmctld and slurmdbd logs and can't see any errors. I believe all the comms are ok between slurmctld and slurmdbd as when I enter the sacct …
[View More]command I can see the database is being queried but returning nothing, because nothing has been added to the tables. The cluster tables were created fine when I ran
#sacctmgr add cluster ny5ktt
$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
# tail -f slurmdbd.log
[2024-10-17T12:34:45.232] debug: REQUEST_PERSIST_INIT: CLUSTER:ny5ktt VERSION:9216 UID:10001 IP:10.202.233.117 CONN:10
[2024-10-17T12:34:45.232] debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connection 1
[2024-10-17T12:34:45.233] debug2: Attempting to connect to localhost:3306
[2024-10-17T12:34:45.274] debug2: DBD_GET_JOBS_COND: called
[2024-10-17T12:34:45.317] debug2: DBD_FINI: CLOSE:1 COMMIT:0
[2024-10-17T12:34:45.317] debug4: accounting_storage/as_mysql: acct_storage_p_commit: got 0 commits
The Mariadb is running on it own node with slurmdbd and munged for authentication. I haven't setup any accounts, users, asssociations or enforcements yet. On my lab cluster, jobs were visible in the database without these being setup. I guess I must be missing something simple in the config that is stopping jobs being reported to slurmdbd.
Master Node packages
# rpm -qa |grep slurm
slurm-slurmdbd-20.11.9-1.el8.x86_64
slurm-libs-20.11.9-1.el8.x86_64
slurm-20.11.9-1.el8.x86_64
slurm-slurmd-20.11.9-1.el8.x86_64
slurm-perlapi-20.11.9-1.el8.x86_64
slurm-doc-20.11.9-1.el8.x86_64
slurm-contribs-20.11.9-1.el8.x86_64
slurm-slurmctld-20.11.9-1.el8.x86_64
Database Node packages
# rpm -qa |grep slurm
slurm-slurmdbd-20.11.9-1.el8.x86_64
slurm-20.11.9-1.el8.x86_64
slurm-libs-20.11.9-1.el8.x86_64
slurm-devel-20.11.9-1.el8.x86_64
slurm.conf
#
# See the slurm.conf man page for more information.
#
ClusterName=ny5ktt
ControlMachine=ny5-pr-kttslurm-01
ControlAddr=10.202.233.71
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
MailProg=/bin/true
MaxJobCount=200000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
#MinJobAge=300
#MinJobAge=43200
# CHG0057915
MinJobAge=14400
# CHG0057915
#MaxJobCount=50000
#MaxJobCount=100000
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
DefMemPerCPU=3000
#FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#SelectTypeParameters=CR_Core
#SelectTypeParameters=CR_CPU
SelectTypeParameters=CR_CPU_Memory
# ECR CHG0056915 10/14/2023
MaxArraySize=5001
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageEnforce=limits
AccountingStorageHost=ny5-pr-kttslurmdb-01.ktt.schonfeld.com
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
#AccountingStorageType=accounting_storage/none
AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageUser=
AccountingStoreJobComment=YES
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=60
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
##using fqdn since the ctld domain is different. Can't use regex since it's not at the end
##save 17 and 18 as headnodes
#NodeName=ny5-dv-kttres-17 Sockets=1 CoresPerSocket=18 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400
#NodeName=ny5-dv-kttres-18 Sockets=1 CoresPerSocket=14 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400
NodeName=ny5-dv-kttres-19 Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400
NodeName=ny5-dv-kttres-[20-21] Sockets=1 CoresPerSocket=18 ThreadsPerCore=2 Feature=HyperThread RealMemory=102400
NodeName=ny5-dv-kttres-[01-16] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Feature=HyperThread RealMemory=233472
NodeName=ny5-dv-kttres-[22-35] Sockets=2 CoresPerSocket=32 ThreadsPerCore=2 Feature=HyperThread RealMemory=346884
PartitionName=ktt_slurm_light_1 Nodes=ny5-dv-kttres-[19-21] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
PartitionName=ktt_slurm_medium_1 Nodes=ny5-dv-kttres-[01-08] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
PartitionName=ktt_slurm_medium_2 Nodes=ny5-dv-kttres-[09-16] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
PartitionName=ktt_slurm_medium_3 Nodes=ny5-dv-kttres-[22-28] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
PartitionName=ktt_slurm_medium_4 Nodes=ny5-dv-kttres-[29-35] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
PartitionName=ktt_slurm_large_1 Nodes=ny5-dv-kttres-[01-16] Default=YES MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
PartitionName=ktt_slurm_large_2 Nodes=ny5-dv-kttres-[22-35] Default=NO MaxTime=INFINITE State=UP OverSubscribe=FORCE:2
Slurmdbd.conf
AuthType=auth/munge
DbdAddr=10.202.233.72
DbdHost=ny5-pr-kttslurmdb-01
DebugLevel=debug5
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/tmp/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
#StorageHost=10.234.132.57
StorageUser=slurm
SlurmUser=slurm
StoragePass=xxxxxxx
#StorageUser=slurm
#StorageLoc=slurm_acct_db
Database tables
MariaDB [slurm_acct_db]> show tables;
+--------------------------------+
| Tables_in_slurm_acct_db |
+--------------------------------+
| acct_coord_table |
| acct_table |
| clus_res_table |
| cluster_table |
| convert_version_table |
| federation_table |
| ny5ktt_assoc_table |
| ny5ktt_assoc_usage_day_table |
| ny5ktt_assoc_usage_hour_table |
| ny5ktt_assoc_usage_month_table |
| ny5ktt_event_table |
| ny5ktt_job_table |
| ny5ktt_last_ran_table |
| ny5ktt_resv_table |
| ny5ktt_step_table |
| ny5ktt_suspend_table |
| ny5ktt_usage_day_table |
| ny5ktt_usage_hour_table |
| ny5ktt_usage_month_table |
| ny5ktt_wckey_table |
| ny5ktt_wckey_usage_day_table |
| ny5ktt_wckey_usage_hour_table |
| ny5ktt_wckey_usage_month_table |
| qos_table |
| res_table |
| table_defs_table |
| tres_table |
| txn_table |
| user_table |
+--------------------------------+
Many Thanks
Adrian
Disclaimer
Schonfeld Strategic Advisors (UK) LLP (“SSA UK”) is authorised and regulated by The Financial Conduct Authority. SSA UK is a limited liability partnership in England and Wales (No: OC420598) and its registered office is at 54 Jermyn Street, London, SW1Y 6LX. The contents of this message, including any attachments, are meant solely for the intended recipient and may be confidential, privileged, or otherwise protected from disclosure. If you receive this message in error, immediately alert the sender by reply e-mail, delete it and any attachments or copies from your systems, and do not read, disclose, distribute, or otherwise use the information contained herein. We do not waive any confidentiality or privilege if this message was misdirected. This e-mail does not constitute an offer to sell or a solicitation to buy any securities or an offer of any investment advisory services. If you reply to this email please note that we invest in securities and do not want to receive material, non-public information and you are instructed not to communicate any such information to us. We do not agree to keep confidential any information you provide nor restrict our trading activity, except as agreed pursuant to a written confidentiality agreement duly executed by us. We reserve the right to monitor and review the content of all messages sent to or from this e-mail address.
[View Less]
Dear all,
we've set up SLURM 24.05.3 on our cluster and are experiencing an issue with interactive jobs. Before, we used 21.08 and pretty much the same settings, but without these issues. We've started with a fresh DB etc.
The behavior of interactive jobs is very erratic. Sometimes they start absolutely fine, at other times they die silently in the background, while the user has to wait indefinitely. We have been unable to isolate certain users or nodes affected by this. On a given node, one …
[View More]user might be able to start an interactive job, while another user at the same time isn't able to. The day after, the situation might be the other way around.
The exception are jobs that use a reservation. These start fine every time as far as we can tell. At the same time, the number of idle nodes does not seem to influence the behavior I described above.
Failed allocation on the front end:
[user1@login1 ~]$ salloc
salloc: Pending job allocation 5052052
salloc: job 5052052 queued and waiting for resources
The same job on the backend:
2024-10-14 11:41:57.680 slurmctld: _job_complete: JobId=5052052 done
2024-10-14 11:41:57.678 slurmctld: _job_complete: JobId=5052052 WEXITSTATUS 1
2024-10-14 11:41:57.678 slurmctld: Killing interactive JobId=5052052: Communication connection failure
2024-10-14 11:41:46.666 slurmctld: sched/backfill: _start_job: Started JobId=5052052 in devel on m02n01
2024-10-14 11:41:30.096 slurmctld: sched: _slurm_rpc_allocate_resources JobId=5052052 NodeList=(null) usec=6258
Raising the debug level has not brought additional information. We were hoping, that one of you might be able to provide some insight into what the next steps in troubleshooting might be.
Best regards,
Onno
[View Less]