I have two machines. When I run "srum hostname" on one machine (it's both a controller and a node) then I get the hostname fine but I get socket timed out error in these two situations:
1) "srun hostname" on 2nd machine (it's a node) 2) "srun -N 2 hostname" on controller
"scontrol show node" shows both mach2 and mach4. "sinfo" shows both nodes too. Also the job gets stuck forever in CG state after the error. Here is the output:
$ srun -N 2 hostname mach2 srun: error: slurm_receive_msgs: [[mach4]:6818] failed: Socket timed out on send/recv operation srun: error: Task launch for StepId=2222.0 failed on node hpc4: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted
Output form "squeue" 3 seconds apart
Tue Jun 11 05:09:56 2024 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2222 poxo hostname arnuld R 0:19 2 mach4,mach2
Tue Jun 11 05:09:59 2024 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2222 poxo hostname arnuld CG 0:20 1 mach4
I enabled "debug3" logging and saw this in the node log:
error: mpi_conf_send_stepd: unable to resolve MPI plugin offset from plugin_id=106. This error usually results from a job being submitted against an MPI plugin which was not compiled into slurmd but was for job submission command. error: _send_slurmstepd_init: mpi_conf_send_stepd(9, 106) failed: No error
I removed "MpiDefault" option from slurm.conf and now "srun -N2 -l hostname" returns hostnames of all machines
On Tue, Jun 11, 2024 at 11:05 AM Arnuld arnuld@aganitha.ai wrote:
I have two machines. When I run "srum hostname" on one machine (it's both a controller and a node) then I get the hostname fine but I get socket timed out error in these two situations:
- "srun hostname" on 2nd machine (it's a node)
- "srun -N 2 hostname" on controller
"scontrol show node" shows both mach2 and mach4. "sinfo" shows both nodes too. Also the job gets stuck forever in CG state after the error. Here is the output:
$ srun -N 2 hostname mach2 srun: error: slurm_receive_msgs: [[mach4]:6818] failed: Socket timed out on send/recv operation srun: error: Task launch for StepId=2222.0 failed on node hpc4: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted
Output form "squeue" 3 seconds apart
Tue Jun 11 05:09:56 2024 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2222 poxo hostname arnuld R 0:19 2 mach4,mach2
Tue Jun 11 05:09:59 2024 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2222 poxo hostname arnuld CG 0:20 1 mach4