[slurm-users] Which ports does slurm use?

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Fri Feb 7 21:33:33 UTC 2020


On 06-02-2020 22:40, Dean Schulze wrote:
> I've moved two nodes to a different controller.  The nodes are wired and 
> the controller is networked via wifi.  I had to open up ports 6817 and 
> 6818 between the wired and wireless sides of our network to get any 
> connectivity.
> 
> Now when I do
> 
> srun -N2 hostname
> 
> the jobs show connection timeouts on the nodes:
> 
> [2020-02-06T14:24:37.183] launch task 60.0 request from UID:1000 
> GID:1000 HOST:10.204.18.232 PORT:19602
> [2020-02-06T14:24:37.183] lllp_distribution jobid [60] implicit auto 
> binding: cores, dist 8192
> [2020-02-06T14:24:37.183] _task_layout_lllp_cyclic
> [2020-02-06T14:24:37.183] _lllp_generate_cpu_bind jobid [60]: mask_cpu, 
> 0x0101
> [2020-02-06T14:24:37.184] _run_prolog: run job script took usec=6
> [2020-02-06T14:24:37.184] _run_prolog: prolog with lock for job 60 ran 
> for 0 seconds
> [2020-02-06T14:24:45.224] [60.0] error: connect io: Connection timed out
> [2020-02-06T14:24:45.224] [60.0] error: IO setup failed: Connection 
> timed out
> [2020-02-06T14:24:45.225] [60.0] error: job_manager exiting abnormally, 
> rc = 4021
> [2020-02-06T14:24:59.538] [60.0] error: _send_launch_resp: Failed to 
> send RESPONSE_LAUNCH_TASKS: Connection timed out
> [2020-02-06T14:24:59.551] [60.0] done with job
> 
> That node used port 19602 and the other node was using port 12496.  When 
> I did the srun again the jobs showed two different ports on the nodes 
> (58040 and 32392).
> 
> How can I configure a network if srun is going to grab different ports 
> each time?

Perhaps the information about firewall setup in my Wiki page can be of use:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons

/Ole



More information about the slurm-users mailing list