Hi all, first of all, sorry for my English, it's not my native language.
We are currently experiencing an issue with srun and salloc on our login nodes, while sbatch works properly.
slurm version 23.11.4.
slurmctld runs on management node mmgt01. srun and salloc fail intermittently on login node, that means we can successfully use srun on login node from time to time, but it stops working for a while without us changing any configuration.
login nodes reports the following to the user:
$ srun -N 1 -n 64 --partition=gpunode --pty /bin/bash srun: job 4872 queued and waiting for resources srun: error: Security violation, slurm message from uid 202 srun: error: Security violation, slurm message from uid 202 srun: error: Security violation, slurm message from uid 202 srun: error: Task launch for StepId=4872.0 failed on node cn065: Invalid job credential srun: error: Application launch failed: Invalid job credential srun: Job step aborted
uid 202 is the slurm user.
On the server side, slurmctld logs show:
sched: _slurm_rpc_allocate_resources JobId=4872 NodeList=(null) usec=228 sched: Allocate JobId=4872 NodeList=cn065 #CPUs=64 Partition=gpunode error: slurm_receive_msgs: [[snmgt01]:38727] failed: Zero Bytes were transmitted or received Killing interactive JobId=4872: Communication connection failure _job_complete: JobId=4872 WEXITSTATUS 1 _job_complete: JobId=4872 done step_partial_comp: JobId=4872 StepID=0 invalid; this step may have already completed _slurm_rpc_complete_job_allocation: JobId=4872 error Job/step already completing or completed
We suspect it is a network issue regarding the Zero Bytes were transmitted or received.
The configless system is working properly. Slurmd on login node can read changes made at slurm.conf after a scontrol reconfig.
srun runs successfully from management nodes and from compute nodes. The issue is from the login node.
scontrol ping always shows DOWN from login node, even when we can successfully run srun or salloc.
$ scontrol ping Slurmctld(primary) at mmgt01 is DOWN
We checked as well for munge consistency.
mmgt and login nodes have the hostnames of their respective others on /etc/hosts. They can communicate.
We would really appreciate some tips on what we could be missing.
Best regards, Bruno Bruzzo System Administrator - Clementina XXI
Bruno Bruzzo via slurm-users slurm-users@lists.schedmd.com writes:
slurmctld runs on management node mmgt01. srun and salloc fail intermittently on login node, that means we can successfully use srun on login node from time to time, but it stops working for a while without us changing any configuration.
This, to me, sounds like there could be a problem on the compute nodes, or the communication between logins and computes. One thing that have bit me several times over the years, is compute nodes missing from /etc/hosts on other compute nodes. Slurmctld is often sending messages to computes via other computes, and if the messages happen go go via a node that does not have the target compute in its /etc/hosts, it cannot forward the message.
Another thing to look out for, is to check whether any nodes running slurmd (computes or logins) have their slurmd port blocked by firewalld or something else.
scontrol ping always shows DOWN from login node, even when we can successfully run srun or salloc.
This might indicate that the slurmctld port on mmgt01 is blocked, or the slurmd port on the logins.
It might be something completely different, but I'd at least check /etc/hosts on all nodes (controller, logins, computes) and check that all needed ports are unblocked.
Hi, sorry for the late reply.
We tested your proposal and can confirm that all nodes have each other on their respective /etc/hosts.We can also confirm that the slurmd port is not blocked.
One thing we found useful to reproduce the issue is that if we run srun -w <node x> and on another session srun -w <node x>, the second srun waits for resources while the first one gets into <node x>. If we exit the session on the first shell, the one that was waiting gets error: security violation/invalid job credentials instead of getting into <node x>.
We also found that scontrol ping not only fails on the login node, but also on the nodes of a specific partition, showing the larger message:
Slurmctld(primary) at <headnode> is DOWN ***************************************** ** RESTORE SLURMCTLD DAEMON TO SERVICE ** ***************************************** Still, slurm is able to assign those nodes for jobs.
We also raised debug to the max on slurmctld, and when doing the scontrol ping, we get this log: [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] auth_g_verify: REQUEST_PING has authentication error: Unspecified error [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] Protocol authentication error [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]: Protocol authentication error [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202 [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] auth_g_verify: REQUEST_PING has authentication error: Unspecified error [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] Protocol authentication error [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55286]: Protocol authentication error [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202
I find it suspicious that the date munge shows is Wed Dec 31 21:00:00 1969. I checked correct ownership of the munge.key and that all nodes have the same file.
Does anyone has more documentation on what scontrol ping does? We haven't found detailed information on the docs.
Best regards, Bruno Bruzzo System Administrator - Clementina XXI
El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via slurm-users ( slurm-users@lists.schedmd.com) escribió:
Bruno Bruzzo via slurm-users slurm-users@lists.schedmd.com writes:
slurmctld runs on management node mmgt01. srun and salloc fail intermittently on login node, that means we can successfully use srun on login node from time to time, but it stops working for a while without us changing any configuration.
This, to me, sounds like there could be a problem on the compute nodes, or the communication between logins and computes. One thing that have bit me several times over the years, is compute nodes missing from /etc/hosts on other compute nodes. Slurmctld is often sending messages to computes via other computes, and if the messages happen go go via a node that does not have the target compute in its /etc/hosts, it cannot forward the message.
Another thing to look out for, is to check whether any nodes running slurmd (computes or logins) have their slurmd port blocked by firewalld or something else.
scontrol ping always shows DOWN from login node, even when we can successfully run srun or salloc.
This might indicate that the slurmctld port on mmgt01 is blocked, or the slurmd port on the logins.
It might be something completely different, but I'd at least check /etc/hosts on all nodes (controller, logins, computes) and check that all needed ports are unblocked.
-- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Err., are all your nodes on the same time?
Actually slurmd will not start if a compute node is too far away in time from the controller node. So you should be OK
I would still check the times on all nodes are in agreement
On Wed, Sep 24, 2025, 7:19 PM Bruno Bruzzo via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi, sorry for the late reply.
We tested your proposal and can confirm that all nodes have each other on their respective /etc/hosts.We can also confirm that the slurmd port is not blocked.
One thing we found useful to reproduce the issue is that if we run srun -w <node x> and on another session srun -w <node x>, the second srun waits for resources while the first one gets into <node x>. If we exit the session on the first shell, the one that was waiting gets error: security violation/invalid job credentials instead of getting into <node x>.
We also found that scontrol ping not only fails on the login node, but also on the nodes of a specific partition, showing the larger message:
Slurmctld(primary) at <headnode> is DOWN
** RESTORE SLURMCTLD DAEMON TO SERVICE **
Still, slurm is able to assign those nodes for jobs.
We also raised debug to the max on slurmctld, and when doing the scontrol ping, we get this log: [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] auth_g_verify: REQUEST_PING has authentication error: Unspecified error [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] Protocol authentication error [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]: Protocol authentication error [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202 [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] auth_g_verify: REQUEST_PING has authentication error: Unspecified error [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] Protocol authentication error [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55286]: Protocol authentication error [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202
I find it suspicious that the date munge shows is Wed Dec 31 21:00:00 1969. I checked correct ownership of the munge.key and that all nodes have the same file.
Does anyone has more documentation on what scontrol ping does? We haven't found detailed information on the docs.
Best regards, Bruno Bruzzo System Administrator - Clementina XXI
El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via slurm-users ( slurm-users@lists.schedmd.com) escribió:
Bruno Bruzzo via slurm-users slurm-users@lists.schedmd.com writes:
slurmctld runs on management node mmgt01. srun and salloc fail intermittently on login node, that means we can successfully use srun on login node from time to time, but it stops working for a while without us changing any configuration.
This, to me, sounds like there could be a problem on the compute nodes, or the communication between logins and computes. One thing that have bit me several times over the years, is compute nodes missing from /etc/hosts on other compute nodes. Slurmctld is often sending messages to computes via other computes, and if the messages happen go go via a node that does not have the target compute in its /etc/hosts, it cannot forward the message.
Another thing to look out for, is to check whether any nodes running slurmd (computes or logins) have their slurmd port blocked by firewalld or something else.
scontrol ping always shows DOWN from login node, even when we can successfully run srun or salloc.
This might indicate that the slurmctld port on mmgt01 is blocked, or the slurmd port on the logins.
It might be something completely different, but I'd at least check /etc/hosts on all nodes (controller, logins, computes) and check that all needed ports are unblocked.
-- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Yes, all nodes are synchronized with crony.
El mié, 24 sept 2025 a la(s) 3:28 p.m., John Hearns (hearnsj@gmail.com) escribió:
Err., are all your nodes on the same time?
Actually slurmd will not start if a compute node is too far away in time from the controller node. So you should be OK
I would still check the times on all nodes are in agreement
On Wed, Sep 24, 2025, 7:19 PM Bruno Bruzzo via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi, sorry for the late reply.
We tested your proposal and can confirm that all nodes have each other on their respective /etc/hosts.We can also confirm that the slurmd port is not blocked.
One thing we found useful to reproduce the issue is that if we run srun -w <node x> and on another session srun -w <node x>, the second srun waits for resources while the first one gets into <node x>. If we exit the session on the first shell, the one that was waiting gets error: security violation/invalid job credentials instead of getting into <node x>.
We also found that scontrol ping not only fails on the login node, but also on the nodes of a specific partition, showing the larger message:
Slurmctld(primary) at <headnode> is DOWN
** RESTORE SLURMCTLD DAEMON TO SERVICE **
Still, slurm is able to assign those nodes for jobs.
We also raised debug to the max on slurmctld, and when doing the scontrol ping, we get this log: [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] auth_g_verify: REQUEST_PING has authentication error: Unspecified error [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] Protocol authentication error [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]: Protocol authentication error [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202 [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] auth_g_verify: REQUEST_PING has authentication error: Unspecified error [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] Protocol authentication error [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55286]: Protocol authentication error [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202
I find it suspicious that the date munge shows is Wed Dec 31 21:00:00 1969. I checked correct ownership of the munge.key and that all nodes have the same file.
Does anyone has more documentation on what scontrol ping does? We haven't found detailed information on the docs.
Best regards, Bruno Bruzzo System Administrator - Clementina XXI
El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via slurm-users ( slurm-users@lists.schedmd.com) escribió:
Bruno Bruzzo via slurm-users slurm-users@lists.schedmd.com writes:
slurmctld runs on management node mmgt01. srun and salloc fail intermittently on login node, that means we can successfully use srun on login node from time to time, but it stops working for a while without us changing any configuration.
This, to me, sounds like there could be a problem on the compute nodes, or the communication between logins and computes. One thing that have bit me several times over the years, is compute nodes missing from /etc/hosts on other compute nodes. Slurmctld is often sending messages to computes via other computes, and if the messages happen go go via a node that does not have the target compute in its /etc/hosts, it cannot forward the message.
Another thing to look out for, is to check whether any nodes running slurmd (computes or logins) have their slurmd port blocked by firewalld or something else.
scontrol ping always shows DOWN from login node, even when we can successfully run srun or salloc.
This might indicate that the slurmctld port on mmgt01 is blocked, or the slurmd port on the logins.
It might be something completely different, but I'd at least check /etc/hosts on all nodes (controller, logins, computes) and check that all needed ports are unblocked.
-- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Shot down in 🔥🔥
On Wed, Sep 24, 2025, 7:43 PM Bruno Bruzzo bbruzzo@dc.uba.ar wrote:
Yes, all nodes are synchronized with crony.
El mié, 24 sept 2025 a la(s) 3:28 p.m., John Hearns (hearnsj@gmail.com) escribió:
Err., are all your nodes on the same time?
Actually slurmd will not start if a compute node is too far away in time from the controller node. So you should be OK
I would still check the times on all nodes are in agreement
On Wed, Sep 24, 2025, 7:19 PM Bruno Bruzzo via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi, sorry for the late reply.
We tested your proposal and can confirm that all nodes have each other on their respective /etc/hosts.We can also confirm that the slurmd port is not blocked.
One thing we found useful to reproduce the issue is that if we run srun -w <node x> and on another session srun -w <node x>, the second srun waits for resources while the first one gets into <node x>. If we exit the session on the first shell, the one that was waiting gets error: security violation/invalid job credentials instead of getting into <node x>.
We also found that scontrol ping not only fails on the login node, but also on the nodes of a specific partition, showing the larger message:
Slurmctld(primary) at <headnode> is DOWN
** RESTORE SLURMCTLD DAEMON TO SERVICE **
Still, slurm is able to assign those nodes for jobs.
We also raised debug to the max on slurmctld, and when doing the scontrol ping, we get this log: [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] auth_g_verify: REQUEST_PING has authentication error: Unspecified error [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] Protocol authentication error [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]: Protocol authentication error [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202 [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] auth_g_verify: REQUEST_PING has authentication error: Unspecified error [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] Protocol authentication error [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55286]: Protocol authentication error [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202
I find it suspicious that the date munge shows is Wed Dec 31 21:00:00 1969. I checked correct ownership of the munge.key and that all nodes have the same file.
Does anyone has more documentation on what scontrol ping does? We haven't found detailed information on the docs.
Best regards, Bruno Bruzzo System Administrator - Clementina XXI
El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via slurm-users (slurm-users@lists.schedmd.com) escribió:
Bruno Bruzzo via slurm-users slurm-users@lists.schedmd.com writes:
slurmctld runs on management node mmgt01. srun and salloc fail intermittently on login node, that means we can successfully use srun on login node from time to time, but it stops working for a while without us changing any configuration.
This, to me, sounds like there could be a problem on the compute nodes, or the communication between logins and computes. One thing that have bit me several times over the years, is compute nodes missing from /etc/hosts on other compute nodes. Slurmctld is often sending messages to computes via other computes, and if the messages happen go go via a node that does not have the target compute in its /etc/hosts, it cannot forward the message.
Another thing to look out for, is to check whether any nodes running slurmd (computes or logins) have their slurmd port blocked by firewalld or something else.
scontrol ping always shows DOWN from login node, even when we can successfully run srun or salloc.
This might indicate that the slurmctld port on mmgt01 is blocked, or the slurmd port on the logins.
It might be something completely different, but I'd at least check /etc/hosts on all nodes (controller, logins, computes) and check that all needed ports are unblocked.
-- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Update: We have solved the issue.
Our problem was that even tough we have a configless configuration, our provisioning served a unconfigured slurm.conf file to /etc/slurm
On the failing nodes, we could see:
scontrol show config | grep -i "hash_val"
cn080: HASH_VAL = Different Ours=<...> a Slurmctld=<...>
While on working nodes we saw:
scontrol show config | grep -i "hash_val"
cn044: HASH_VAL = Match
Note: The failing nodes could still get jobs scheduled via sbatch. The issue was with srun/salloc.
We removed the slurm.conf file, restarted services, and for now, everything works fine.
Thanks for the support.
Bruno Bruzzo System Administrator - Clementina XXI
El mié, 24 sept 2025 a la(s) 3:51 p.m., John Hearns (hearnsj@gmail.com) escribió:
Shot down in 🔥🔥
On Wed, Sep 24, 2025, 7:43 PM Bruno Bruzzo bbruzzo@dc.uba.ar wrote:
Yes, all nodes are synchronized with crony.
El mié, 24 sept 2025 a la(s) 3:28 p.m., John Hearns (hearnsj@gmail.com) escribió:
Err., are all your nodes on the same time?
Actually slurmd will not start if a compute node is too far away in time from the controller node. So you should be OK
I would still check the times on all nodes are in agreement
On Wed, Sep 24, 2025, 7:19 PM Bruno Bruzzo via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi, sorry for the late reply.
We tested your proposal and can confirm that all nodes have each other on their respective /etc/hosts.We can also confirm that the slurmd port is not blocked.
One thing we found useful to reproduce the issue is that if we run srun -w <node x> and on another session srun -w <node x>, the second srun waits for resources while the first one gets into <node x>. If we exit the session on the first shell, the one that was waiting gets error: security violation/invalid job credentials instead of getting into <node x>.
We also found that scontrol ping not only fails on the login node, but also on the nodes of a specific partition, showing the larger message:
Slurmctld(primary) at <headnode> is DOWN
** RESTORE SLURMCTLD DAEMON TO SERVICE **
Still, slurm is able to assign those nodes for jobs.
We also raised debug to the max on slurmctld, and when doing the scontrol ping, we get this log: [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] auth_g_verify: REQUEST_PING has authentication error: Unspecified error [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] Protocol authentication error [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]: Protocol authentication error [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202 [2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] auth_g_verify: REQUEST_PING has authentication error: Unspecified error [2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] Protocol authentication error [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55286]: Protocol authentication error [2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202
I find it suspicious that the date munge shows is Wed Dec 31 21:00:00 1969. I checked correct ownership of the munge.key and that all nodes have the same file.
Does anyone has more documentation on what scontrol ping does? We haven't found detailed information on the docs.
Best regards, Bruno Bruzzo System Administrator - Clementina XXI
El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via slurm-users (slurm-users@lists.schedmd.com) escribió:
Bruno Bruzzo via slurm-users slurm-users@lists.schedmd.com writes:
slurmctld runs on management node mmgt01. srun and salloc fail intermittently on login node, that means we can successfully use srun on login node from time to time, but it stops working for a while without us changing any configuration.
This, to me, sounds like there could be a problem on the compute nodes, or the communication between logins and computes. One thing that have bit me several times over the years, is compute nodes missing from /etc/hosts on other compute nodes. Slurmctld is often sending messages to computes via other computes, and if the messages happen go go via a node that does not have the target compute in its /etc/hosts, it cannot forward the message.
Another thing to look out for, is to check whether any nodes running slurmd (computes or logins) have their slurmd port blocked by firewalld or something else.
scontrol ping always shows DOWN from login node, even when we can successfully run srun or salloc.
This might indicate that the slurmctld port on mmgt01 is blocked, or the slurmd port on the logins.
It might be something completely different, but I'd at least check /etc/hosts on all nodes (controller, logins, computes) and check that all needed ports are unblocked.
-- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On 9/30/25 20:52, Bruno Bruzzo via slurm-users wrote:
Update: We have solved the issue.
Our problem was that even tough we have a configless configuration, our provisioning served a unconfigured slurm.conf file to /etc/slurm
FYI: Configless Slurm documents this order of precedence for which slurm.conf is used. The Configless slurm.conf has the lowest priority of all, see https://slurm.schedmd.com/configless_slurm.html#NOTES
Best regards, Ole