I have two machines. When I run "srum hostname" on one machine (it's both a controller and a node) then I get the hostname fine but I get socket timed out error in these two situations: 

1) "srun hostname" on 2nd machine (it's a node)
2) "srun -N 2 hostname" on controller  

"scontrol show node" shows both mach2 and mach4. "sinfo" shows both nodes too.  Also the job gets stuck forever in CG state after the error. Here is the output:
  
$ srun -N 2 hostname
mach2
srun: error: slurm_receive_msgs: [[mach4]:6818] failed: Socket timed out on send/recv operation
srun: error: Task launch for StepId=2222.0 failed on node hpc4: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted


Output form "squeue" 3 seconds apart

Tue Jun 11 05:09:56 2024
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2222     poxo hostname   arnuld  R       0:19      2 mach4,mach2

Tue Jun 11 05:09:59 2024
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2222     poxo hostname   arnuld CG       0:20      1 mach4