- slurm-users - lists.schedmd.com

_refresh_assoc_mgr_qos_list: no new list given back keeping cached one
by joao.damas＠syngenta.com 05 Aug '24

05 Aug '24

Hi all, We are doing a simple setup for a Slurm cluster (version 23.11.6). We follow the documentation and we are trying a setup still without accounting or slurmdbd. The slurm.conf is really simple: ``` ClusterName=Develop SlurmctldHost=head # Slurm configuration AuthType=auth/munge CryptoType=crypto/munge SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdLogFile=/var/log/slurm/slurmd.log SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurmctld # Nodes NodeName=worker1 CoresPerSocket=2 Sockets=1 ThreadsPerCore=1 NodeName=worker2 CoresPerSocket=2 Sockets=1 ThreadsPerCore=1 # Partitions PartitionName=develop Default=YES MaxTime=UNLIMITED Nodes="worker1,worker2" ``` When running a simple `srun sleep 10`, all works well and the log file shows: [2024-05-15T12:34:12.741] sched: _slurm_rpc_allocate_resources JobId=1 NodeList=worker1 usec=549 [2024-05-15T12:34:22.775] _job_complete: JobId=1 WEXITSTATUS 0 [2024-05-15T12:34:22.775] _job_complete: JobId=1 done But when creating a scrip with the same sleep command, and submiting using `sbatch test.sh`, the log shows: [2024-05-15T12:35:39.916] _slurm_rpc_submit_batch_job: JobId=2 InitPrio=1 usec=368 [2024-05-15T12:35:40.000] error: _refresh_assoc_mgr_qos_list: no new list given back keeping cached one. [2024-05-15T12:35:40.000] sched: JobId=2 has invalid account [2024-05-15T12:35:40.145] sched/backfill: _start_job: Started JobId=2 in develop on worker1 [2024-05-15T12:35:50.172] _job_complete: JobId=2 WEXITSTATUS 0 [2024-05-15T12:35:50.172] _job_complete: JobId=2 done We have the same account with the UID and GID, as said in the documentation. Looking at the function that seems to spit out that error (https://github.com/SchedMD/slurm/blob/e9f28ede27795f525e62f998cb2d40931d884…), it appears like there should be some accounting setup? We do not have slurmdbd setup and the documentation states we should test basic functionality before implementing that daemon. Any tips? Thanks in advance. João

4 3

problem with squeue --json with version 24.05.1
by Markus Köberl 05 Aug '24

05 Aug '24

$ squeue --version slurm 24.05.1 $ squeue --json malloc(): invalid size (unsorted) Aborted forcing an older data_parser version works: $ squeue --json=v0.0.40 regards Markus Köberl -- Markus Koeberl Graz University of Technology Signal Processing and Speech Communication Laboratory E-mail: markus.koeberl(a)tugraz.at

4 5

Upgrade node while jobs running
by Sid Young 03 Aug '24

03 Aug '24

G'day all, I've been waiting for node to become idle before upgrading them however some jobs take a long time. If I try to remove all the packages I assume that kills the slurmstep program and with it the job. Sid

5 5

slumrestd 24.05.1: crashes when GET on /slurm/v0.0.41/nodes : unsorted double linked list corrupted
by Josef Dvořáček 01 Aug '24

01 Aug '24

Isn't this failure familiar to anyone? When I ask API endpoint "localhost:6820/slurm/v0.0.41/jobs", slurmrestd segrafults with unsorted double linked list corrupted. Anyone using this API endpoint without segfaulting? I do the get using curl: curl --header X-SLURM-USER-NAME:root --header X-SLURM-USER-TOKEN:$SLURM_JWT -G localhost:6820/slurm/v0.0.41/jobs In comparison, curl --header X-SLURM-USER-NAME:root --header X-SLURM-USER-TOKEN:$SLURM_JWT -G localhost:6820/slurm/v0.0.41/nodes Works well. josef čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug: _on_url: [[localhost]:52909] url path: /slurm/v0.0.41/jobs query: (null) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: operations_router: [[localhost]:52909] GET /slurm/v0.0.41/jobs čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: slurmrestd: operations_router: [[localhost]:52909] GET /slurm/v0.0.41/jobs čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: slurmrestd: rest_auth/jwt: slurm_rest_auth_p_authenticate: [[localhost]:52909] attempting user_name root token authentication pass through čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: rest_auth/jwt: slurm_rest_auth_p_authenticate: [[localhost]:52909] attempting user_name root token authentication pass through čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: skip non-matching subdirectories: registered=1 requested=3 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed for ["openapi.json"](0, GET) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: skip non-matching subdirectories: registered=1 requested=3 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed for ["openapi.yaml"](1, GET) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: skip non-matching subdirectories: registered=1 requested=3 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed for ["openapi"](2, GET) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: skip non-matching subdirectories: registered=2 requested=3 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed for ["openapi","v3"](3, GET) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match slurm to slurm: SUCCESS čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match v0.0.41 to v0.0.41: SUCCESS čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match shares to jobs: FAILURE čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed shares čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed for ["slurm","v0.0.41","shares"](4, GET) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match slurm to slurm: SUCCESS čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match v0.0.41 to v0.0.41: SUCCESS čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match reconfigure to jobs: FAILURE čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed reconfigure čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed for ["slurm","v0.0.41","reconfigure"](5, GET) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match slurm to slurm: SUCCESS čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match v0.0.41 to v0.0.41: SUCCESS čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match diag to jobs: FAILURE čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed diag čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed for ["slurm","v0.0.41","diag"](6, GET) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match slurm to slurm: SUCCESS čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match v0.0.41 to v0.0.41: SUCCESS čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match ping to jobs: FAILURE čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed ping čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed for ["slurm","v0.0.41","ping"](7, GET) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match slurm to slurm: SUCCESS čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match v0.0.41 to v0.0.41: SUCCESS čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match licenses to jobs: FAILURE čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed licenses čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed for ["slurm","v0.0.41","licenses"](8, GET) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: method skip for ["slurm","v0.0.41","job","submit"](9, GET != POST) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed for ["slurm","v0.0.41","job","submit"](9, GET) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: method skip for ["slurm","v0.0.41","job","allocate"](10, GET != POST) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match failed for ["slurm","v0.0.41","job","allocate"](10, GET) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match slurm to slurm: SUCCESS čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match v0.0.41 to v0.0.41: SUCCESS čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path: string attempt match jobs to jobs: SUCCESS čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _match_path_from_data: match successful for ["slurm","v0.0.41","jobs"](11, GET) to ["slurm","v0.0.41","jobs"](0x7F9C64001CB0) čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: operations_router: [[localhost]:52909] found callback handler: (0x0) callback_tag=0 path=/slurm/v0.0.41/jobs parser=data_parser/v0.0.41 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug4: _resolve_mime: [[localhost]:52909] did not provide a known content type header. Assuming URL encoded. čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug5: _parse_http_accept_entry: found */* with q=1.000000 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug4: _resolve_mime: [[localhost]:52909] accepts */* with q=1.000000 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug4: _resolve_mime: [[localhost]:52909] found accepts */*=application/json with q=1.000000 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug3: _resolve_mime: [[localhost]:52909] mime read: application/x-www-form-urlencoded write: application/json čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug3: _call_handler: [[localhost]:52909] BEGIN: calling ctxt handler: 0x7F9C9D294A36[0] for path: /slurm/v0.0.41/jobs čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug: wrap_openapi_ctxt_callback: [[localhost]:52909] GET using data_parser/v0.0.41 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x408376 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug4: xsignal: Swap signal PIPE[13] to 0x408376 from 0x1 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug: accounting_storage/slurmdbd: _connect_dbd_conn: Sent PersistInit msg čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x408376 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: debug4: xsignal: Swap signal PIPE[13] to 0x408376 from 0x1 čec 24 14:37:55 slurmserver2.koios.lan slurmrestd[1502900]: malloc(): unsorted double linked list corrupted čec 24 14:37:55 slurmserver2.koios.lan systemd[1]: Started Process Core Dump (PID 1502951/UID 0).

2 3

Slurm versions 24.05.2, 23.11.9, and 23.02.8 are now available (security fix for switch plugins)
by Tim Wickberg 31 Jul '24

31 Jul '24

Slurm versions 24.05.2, 23.11.9, and 23.02.8 are now available and include a fix for a recently discovered security issue with the switch plugins. SchedMD customers were informed on July 17th and provided a patch on request; this process is documented in our security policy. [1] For the switch/hpe_slingshot and switch/nvidia_imex plugins, a user could override the isolation between Slingshot VNIs or IMEX channels. If you do not have one of these switch plugins configured, then you are not impacted by this issue. It is unclear what, if any, information could be accessed with access to an unauthorized channel. This disclosure is being made out of an abundance of caution. If you do have one of these plugins enabled, the slurmctld must be restarted before the slurmd daemons to avoid disruption. Downloads are available at https://www.schedmd.com/downloads.php . Release notes follow below. - Tim [1] https://www.schedmd.com/security-policy/ -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support > * Changes in Slurm 24.05.2 > ========================== > -- Fix energy gathering rpc counter underflow in _rpc_acct_gather_energy when > more than 10 threads try to get energy at the same time. This prevented > the possibility to get energy from slurmd by any step until slurmd was > restarted, so losing energy accounting metrics in the node. > -- accounting_storage/mysql - Fix issue where new user with wckey did not > have a default wckey sent to the slurmctld. > -- slurmrestd - Prevent slurmrestd segfault when handling the following > endpoints when none of the optional parameters are specified: > 'DELETE /slurm/v0.0.40/jobs' > 'DELETE /slurm/v0.0.41/jobs' > 'GET /slurm/v0.0.40/shares' > 'GET /slurm/v0.0.41/shares' > 'GET /slurmdb/v0.0.40/instance' > 'GET /slurmdb/v0.0.41/instance' > 'GET /slurmdb/v0.0.40/instances' > 'GET /slurmdb/v0.0.41/instances' > 'POST /slurm/v0.0.40/job/{job_id}' > 'POST /slurm/v0.0.41/job/{job_id}' > -- Fix IPMI energy gathering when no IPMIPowerSensors are specified in > acct_gather.conf. This situation resulted in an accounted energy of 0 > for job steps. > -- Fix a minor memory leak in slurmctld when updating a job dependency. > -- scontrol,squeue - Fix regression that caused incorrect values for > multisocket nodes at '.jobs[].job_resources.nodes.allocation' for > 'scontrol show jobs --(json|yaml)' and 'squeue --(json|yaml)'. > -- slurmrestd - Fix regression that caused incorrect values for > multisocket nodes at '.jobs[].job_resources.nodes.allocation' to be dumped > with endpoints: > 'GET /slurm/v0.0.41/job/{job_id}' > 'GET /slurm/v0.0.41/jobs' > -- jobcomp/filetxt - Fix truncation of job record lines > 1024 characters. > -- Fixed regression that prevented compilation on FreeBSD hosts. > -- switch/hpe_slingshot - Drain node on failure to delete CXI services. > -- Fix a performance regression from 23.11.0 in cpu frequency handling when no > CpuFreqDef is defined. > -- Fix one-task-per-sharing not working across multiple nodes. > -- Fix inconsistent number of cpus when creating a reservation using the > TRESPerNode option. > -- data_parser/v0.0.40+ - Fix job state parsing which could break filtering. > -- Prevent cpus-per-task to be modified in jobs where a -c value has been > explicitly specified and the requested memory constraints implicitly > increase the number of CPUs to allocate. > -- slurmrestd - Fix regression where args '-s v0.0.39,dbv0.0.39' and > '-d v0.0.39' would result in 'GET /openapi/v3' not registering as a valid > possible query resulting in 404 errors. > -- slurmrestd - Fix memory leak for dbv0.0.39 jobs query which occurred if the > query parameters specified account, association, cluster, constraints, > format, groups, job_name, partition, qos, reason, reservation, state, users, > or wckey. This affects the following endpoints: > 'GET /slurmdb/v0.0.39/jobs' > -- slurmrestd - In the case the slurmdbd does not respond to a persistent > connection init message, prevent the closed fd from being used, and instead > emit an error or warning depending on if the connection was required. > -- Fix 24.05.0 regression that caused the slurmdbd not to send back an error > message if there is an error initializing a persistent connection. > -- Reduce latency of forwarded x11 packets. > -- Add "curr_dependency" (representing the current dependency of the job) > and "orig_dependency" (representing the original requested dependency of > the job) fields to the job record in job_submit.lua (for job update) and > jobcomp.lua. > -- Fix potential segfault of slurmctld configured with > SlurmctldParameters=enable_rpc_queue from happening on reconfigure. > -- Fix potential segfault of slurmctld on its shutdown when rate limitting > is enabled. > -- slurmrestd - Fix missing job environment for SLURM_JOB_NAME, > SLURM_OPEN_MODE, SLURM_JOB_DEPENDENCY, SLURM_PROFILE, SLURM_ACCTG_FREQ, > SLURM_NETWORK and SLURM_CPU_FREQ_REQ to match sbatch. > -- Add missing bash-completions dependency to slurm-smd-client debian package. > -- Fix bash-completions installation in debian pacakges. > -- Fix GRES environment variable indices being incorrect when only using a > subset of all GPUs on a node and the --gres-flags=allow-task-sharing option > -- Add missing mariadb/mysql client package dependency to debian package. > -- Fail the debian package build early if mysql cannot be found. > -- Prevent scontrol from segfaulting when requesting scontrol show reservation > --json or --yaml if there is an error retrieving reservations from the > slurmctld. > -- switch/hpe_slingshot - Fix security issue around managing VNI access. > -- switch/nvidia_imex - Fix security issue managing IMEX channel access. > -- switch/nvidia_imex - Allow for compatibility with job_container/tmpfs. > * Changes in Slurm 23.11.9 > ========================== > -- Fix many commands possibly reporting an "Unexpected Message Received" when > in reality the connection timed out. > -- Fix heterogeneous job components not being signaled with scancel --ctld and > 'DELETE slurm/v0.0.40/jobs' if the job ids are not explicitly given, > the heterogeneous job components match the given filters, and the > heterogeneous job leader does not match the given filters. > -- Fix regression from 23.02 impeding job licenses from being cleared. > -- Move error to log_flag which made _get_joules_task error to be logged to the > user when too many rpcs were queued in slurmd for gathering energy. > -- slurmrestd - Prevent a slurmrestd segfault when modifying an association > without specifying max TRES limits in the request if those TRES > limits are currently defined in the association. This affects the following > fields of endpoint 'POST /slurmdb/v0.0.38/associations/': > 'associations/max/tres/per/job' > 'associations/max/tres/per/node' > 'associations/max/tres/total' > 'associations/max/tres/minutes/per/job' > 'associations/max/tres/minutes/total' > -- Fix power_save operation after recovering from a failed reconfigure. > -- scrun - Delay shutdown until after start requested. This caused scrun > to never start or shutdown and hung forever when using --tty. > -- Fix backup slurmctld potentially not running the agent when taking over as > the primary controller. > -- Fix primary controller not running the agent when a reconfigure of the > slurmctld fails. > -- jobcomp/{elasticsearch,kafka} - Avoid sending fields with invalid date/time. > -- Fix energy gathering rpc counter underflow in _rpc_acct_gather_energy when > more than 10 threads try to get energy at the same time. This prevented > the possibility to get energy from slurmd by any step until slurmd was > restarted, so losing energy accounting metrics in the node. > -- slurmrestd - Fix memory leak for dbv0.0.39 jobs query which occurred if the > query parameters specified account, association, cluster, constraints, > format, groups, job_name, partition, qos, reason, reservation, state, users, > or wckey. This affects the following endpoints: > 'GET /slurmdb/v0.0.39/jobs' > -- switch/hpe_slingshot - Fix security issue around managing VNI access. > * Changes in Slurm 23.02.8 > ========================== > -- Fix rare deadlock when a dynamic node registers at the same time that a > once per minute background task occurs. > -- Fix assertion in developer mode on a failed message unpack. > -- switch/hpe_slingshot - Fix security issue around managing VNI access.

1 0

slurmd error: port already in use, resulting in slaves not being able to communicate with master slurmctld
by Shooktija S N 30 Jul '24

30 Jul '24

Hi, I'm trying to set up a Slurm (version 22.05.8) cluster consisting of 3 nodes with these hostnames and local IP addresses: server1 - 10.36.17.152 server2 - 10.36.17.166 server3 - 10.36.17.132 I had scrambled together a minimum working example using these resources: https://github.com/SergioMEV/slurm-for-dummies https://blog.devops.dev/slurm-complete-guide-a-to-z-concepts-setup-and-trou… For a while everything looked fine and I was able to run the command I usually use to see if everything is fine: srun --label --nodes=3 hostname Which used to show the expected output of the hostnames of all 3 computers, namely: server1, server2, and server3. However - after having made no changes to the configs - the command no longer works if I specify the number of nodes as anything more than 1, this behaviour is consistent on all 3 computers, the output of 'sinfo' is also included below: root@server1:~# srun --nodes=1 hostnameserver1root@server1:~# root@server1:~# srun --nodes=3 hostnamesrun: Required node not available (down, drained or reserved)srun: job 312 queued and waiting for resources^Csrun: Job allocation 312 has been revokedsrun: Force Terminated JobId=312root@server1:~# root@server1:~# ssh server2 "srun --nodes=1 hostname"server1root@server1:~# root@server1:~# ssh server2 "srun --nodes=3 hostname"srun: Required node not available (down, drained or reserved)srun: job 314 queued and waiting for resources^Croot@server1:~# root@server1:~# root@server1:~# sinfoPARTITION AVAIL TIMELIMIT NODES STATE NODELISTmainPartition* up infinite 2 down* server[2-3]mainPartition* up infinite 1 idle server1root@server1:~# Turns out, slurmctld on the master node (hostname: server1) and slurmd on the slave nodes (hostnames: server2 & server3) are throwing some errors probably related to networking: A few lines before and after the first occurence of the error in slurmctld.log on the master node - it's the only type of error I have noticed in the logs (pastebin to the entire log <https://pastebin.com/GBSWXZJR>): root@server1:/var/log# grep -B 20 -A 5 -m1 -i "error" slurmctld.log[2024-07-26T13:13:49.579] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions[2024-07-26T13:13:49.580] debug: power_save module disabled, SuspendTime < 0[2024-07-26T13:13:49.580] Running as primary controller[2024-07-26T13:13:49.580] debug: No backup controllers, not launching heartbeat.[2024-07-26T13:13:49.580] debug: priority/basic: init: Priority BASIC plugin loaded[2024-07-26T13:13:49.580] No parameter for mcs plugin, default values set[2024-07-26T13:13:49.580] mcs: MCSParameters = (null). ondemand set.[2024-07-26T13:13:49.580] debug: mcs/none: init: mcs none plugin loaded[2024-07-26T13:13:49.580] debug2: slurmctld listening on 0.0.0.0:6817[2024-07-26T13:13:52.662] debug: hash/k12: init: init: KangarooTwelve hash plugin loaded[2024-07-26T13:13:52.662] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0[2024-07-26T13:13:52.662] debug: gres/gpu: init: loaded[2024-07-26T13:13:52.662] debug: validate_node_specs: node server1 registered with 0 jobs[2024-07-26T13:13:52.662] debug2: _slurm_rpc_node_registration complete for server1 usec=229[2024-07-26T13:13:53.586] debug: Spawning registration agent for server[2-3] 2 hosts[2024-07-26T13:13:53.586] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2[2024-07-26T13:13:53.586] debug: sched: Running job scheduler for default depth.[2024-07-26T13:13:53.586] debug2: Spawning RPC agent for msg_type REQUEST_NODE_REGISTRATION_STATUS[2024-07-26T13:13:53.587] debug2: Tree head got back 0 looking for 2[2024-07-26T13:13:53.588] debug2: _slurm_connect: failed to connect to 10.36.17.166:6818: Connection refused[2024-07-26T13:13:53.588] debug2: Error connecting slurm stream socket at 10.36.17.166:6818: Connection refused[2024-07-26T13:13:53.588] debug2: _slurm_connect: failed to connect to 10.36.17.132:6818: Connection refused[2024-07-26T13:13:53.588] debug2: Error connecting slurm stream socket at 10.36.17.132:6818: Connection refused[2024-07-26T13:13:54.588] debug2: _slurm_connect: failed to connect to 10.36.17.166:6818: Connection refused[2024-07-26T13:13:54.588] debug2: Error connecting slurm stream socket at 10.36.17.166:6818: Connection refused[2024-07-26T13:13:54.589] debug2: _slurm_connect: failed to connect to 10.36.17.132:6818: Connection refused The connections to 10.36.17.166:6818 and 10.36.17.132:6818 are refused. Those are ports specified by the 'SlurmdPort' key in slurm.conf There are similar errors in the slurmd.log files on both the slave nodes as well: slurmd.log on server2, the error is only at the end of the file (pastebin to the entire log <https://pastebin.com/TwSMiAp7>): root@server2:/var/log# tail -5 slurmd.log [2024-07-26T13:13:53.018] debug: mpi/pmix_v4: init: PMIx plugin loaded[2024-07-26T13:13:53.018] debug: mpi/pmix_v4: init: PMIx plugin loaded[2024-07-26T13:13:53.018] debug2: No mpi.conf file (/etc/slurm/mpi.conf)[2024-07-26T13:13:53.018] error: Error binding slurm stream socket: Address already in use[2024-07-26T13:13:53.018] error: Unable to bind listen port (6818): Address already in use slurmd.log on server3 (pastebin to the entire log <https://pastebin.com/K55cAGLb>): root@server3:/var/log# tail -5 slurmd.log [2024-07-26T13:13:53.383] debug: mpi/pmix_v4: init: PMIx plugin loaded[2024-07-26T13:13:53.383] debug: mpi/pmix_v4: init: PMIx plugin loaded[2024-07-26T13:13:53.383] debug2: No mpi.conf file (/etc/slurm/mpi.conf)[2024-07-26T13:13:53.384] error: Error binding slurm stream socket: Address already in use[2024-07-26T13:13:53.384] error: Unable to bind listen port (6818): Address already in use I use this script to restart slurm whenever I change any of the configs, could the order in which these operations are being done cause the problems I'm facing: #! /bin/bashscp /etc/slurm/slurm.conf /etc/slurm/gres.conf server2:/etc/slurm/ && echo copied slurm.conf and gres.conf to server2;scp /etc/slurm/slurm.conf /etc/slurm/gres.conf server3:/etc/slurm/ && echo copied slurm.conf and gres.conf to server3;echoecho restarting slurmctld and slurmd on server1(scontrol shutdown ; sleep 3 ; rm -f /var/log/slurmd.log /var/log/slurmctld.log ; slurmctld -d ; sleep 3 ; slurmd) && echo doneecho restarting slurmd on server2(ssh server2 "rm -f /var/log/slurmd.log /var/log/slurmctld.log ; slurmd") && echo doneecho restarting slurmd on server3(ssh server3 "rm -f /var/log/slurmd.log /var/log/slurmctld.log ; slurmd") && echo done Config files: slurm.conf without the comments: root@server1:/etc/slurm# grep -v "#" slurm.conf ClusterName=DlabClusterSlurmctldHost=server1GresTypes=gpuProctrackType=proctrack/linuxprocReturnToService=1SlurmctldPidFile=/var/run/slurmctld.pidSlurmctldPort=6817SlurmdPidFile=/var/run/slurmd.pidSlurmdPort=6818SlurmdSpoolDir=/var/spool/slurmdSlurmUser=rootStateSaveLocation=/var/spool/slurmctldTaskPlugin=task/affinity,task/cgroupInactiveLimit=0KillWait=30MinJobAge=300SlurmctldTimeout=120SlurmdTimeout=300Waittime=0SchedulerType=sched/backfillSelectType=select/cons_tresJobCompType=jobcomp/noneJobAcctGatherFrequency=30SlurmctldDebug=debug2SlurmctldLogFile=/var/log/slurmctld.logSlurmdDebug=debug2SlurmdLogFile=/var/log/slurmd.logNodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:1PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP gres.conf: root@server1:/etc/slurm# cat gres.confNodeName=server1 Name=gpu File=/dev/nvidia0NodeName=server2 Name=gpu File=/dev/nvidia0NodeName=server3 Name=gpu File=/dev/nvidia0 These config files are the same on all 3 computers. As a complete beginner to Linux and Slurm administration, I have been struggling to understand even the most basic documentation, and I have been unable to find answers online. Any assistance would be greatly appreciated. Thanks!

1 1

slurmctld hourly: Unexpected missing socket error
by Jason Ellul 29 Jul '24

29 Jul '24

Hi all, I am hoping someone can help with our problem. Every hour after restarting slurmctld the controller becomes unresponsive to commands for 1 sec, reporting errors such as: [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934760]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934875]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934906]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[939016]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error It occurs consistently at around the hour mark, but generally not at other times, unless we run a reconfigure or restart the controller. We don’t see any issues in the slurmdbd.log and the errors are also always msg type RESPONSE. We have tried building a new server on different infrastructure, but the problem has persisted. Yesterday we even tried updating slurm to v24.05.1 in the hope that may provide a fix. During our troubleshooting we have: Set: * SchedulerParameters = max_rpc_cnt=400,sched_min_interval=50000,sched_max_job_start=300,batch_sched_delay=20,bf_resolution=600,bf_min_prio_reserve=2000,bf_min_age_reserve=600 * SlurmctldPort = 6808-6817 But although the stats in sdiag have improved we still see the errors. On our monitoring software we also see a drop in network and disk activity during this 1 second, always at approx. 1 hour after restarting the controller. Many Thanks in advance Jason Jason Ellul Head - Research Computing Facility Office of Cancer Research Peter MacCallum Cancer Centre

2 4

Final Call for SLUG Standard Registration
by Victoria Hobson 29 Jul '24

29 Jul '24

Slurm User Group (SLUG) 2024 is set for September 12-13 at the University of Oslo in Oslo, Norway. Registration information, abstracts, and travel recommendations can be found here:https://slug24.splashthat.com/ The last day to register with standard pricing ($900) is this Friday, August 2nd. After this, final registration will run until August 30th at a price of $1100. SLUG is the best way to interact with the Slurm community and to interact with the SchedMD Support & Training staff. Don't forget to register. We can't wait to see you in Oslo! -- Victoria Hobson SchedMD LLC Vice President of Marketing

1 0

Convergence of Kube and Slurm?
by Dan Healy 29 Jul '24

29 Jul '24

Bright Cluster Manager has some verbiage on their marketing site that they can manage a cluster running both Kubernetes and Slurm. Maybe I misunderstood it. But nevertheless, I am encountering groups more frequently that want to run a stack of containers that need private container networking. What’s the current state of using the same HPC cluster for both Slurm and Kube? Note: I’m aware that I can run Kube on a single node, but we need more resources. So ultimately we need a way to have Slurm and Kube exist in the same cluster, both sharing the full amount of resources and both being fully aware of resource usage. Thanks, Daniel Healy

5 4

Slurm fails before nvidia-smi command
by Aziz Ogutlu 29 Jul '24

29 Jul '24

Hi there all, We have Dell server with 2 x Nvidia H100 and running slurm on it. After restart server if we do not write nvidia-smi command slurm fails. When we run nvidia-smi && systemctl restart slurmd && systemctl restart slurmctld , slurm queue begins. Do you have any idea about this error and what can we do for this issue? -- Best regards, Aziz Öğütlü Eduline Bilişim Sanayi ve Ticaret Ltd. Şti. www.eduline.com.tr Merkez Mah. Ayazma Cad. No:37 Papirus Plaza Kat:6 Ofis No:118 Kağıthane - İstanbul - Türkiye 34406 Tel : +90 212 324 60 61 Cep: +90 541 350 40 72

4 4

2025

2024

slurm-users ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users