[slurm-users] Which ports does slurm use?

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Mon Feb 10 10:41:09 UTC 2020


Hi Dean,

Blocking ports with the Linux firewall and/or your network firewall 
(wired/Wi-Fi) would have the same effect:  Slurm won't work unless you 
open ports as specified in 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons

/Ole

On 2/8/20 1:26 AM, dean.w.schulze at gmail.com wrote:
> The firewalls are disabled on all nodes on my cluster so I don't think it is a firewall issue.  It's probably our network security between the wired part of our network and the wireless side.  When I put the nodes back on a wired controller they work again.
> 
> 
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Ole Holm Nielsen
> Sent: Friday, February 7, 2020 2:34 PM
> To: slurm-users at lists.schedmd.com
> Subject: Re: [slurm-users] Which ports does slurm use?
> 
> On 06-02-2020 22:40, Dean Schulze wrote:
>> I've moved two nodes to a different controller.  The nodes are wired
>> and the controller is networked via wifi.  I had to open up ports 6817
>> and
>> 6818 between the wired and wireless sides of our network to get any
>> connectivity.
>>
>> Now when I do
>>
>> srun -N2 hostname
>>
>> the jobs show connection timeouts on the nodes:
>>
>> [2020-02-06T14:24:37.183] launch task 60.0 request from UID:1000
>> GID:1000 HOST:10.204.18.232 PORT:19602 [2020-02-06T14:24:37.183]
>> lllp_distribution jobid [60] implicit auto
>> binding: cores, dist 8192
>> [2020-02-06T14:24:37.183] _task_layout_lllp_cyclic
>> [2020-02-06T14:24:37.183] _lllp_generate_cpu_bind jobid [60]:
>> mask_cpu,
>> 0x0101
>> [2020-02-06T14:24:37.184] _run_prolog: run job script took usec=6
>> [2020-02-06T14:24:37.184] _run_prolog: prolog with lock for job 60 ran
>> for 0 seconds [2020-02-06T14:24:45.224] [60.0] error: connect io:
>> Connection timed out [2020-02-06T14:24:45.224] [60.0] error: IO setup
>> failed: Connection timed out [2020-02-06T14:24:45.225] [60.0] error:
>> job_manager exiting abnormally, rc = 4021 [2020-02-06T14:24:59.538]
>> [60.0] error: _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS:
>> Connection timed out [2020-02-06T14:24:59.551] [60.0] done with job
>>
>> That node used port 19602 and the other node was using port 12496.
>> When I did the srun again the jobs showed two different ports on the
>> nodes
>> (58040 and 32392).
>>
>> How can I configure a network if srun is going to grab different ports
>> each time?
> 
> Perhaps the information about firewall setup in my Wiki page can be of use:
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
> 
> /Ole



More information about the slurm-users mailing list