[slurm-users] Which ports does slurm use?

Dean Schulze dean.w.schulze at gmail.com
Thu Feb 6 21:40:41 UTC 2020


I've moved two nodes to a different controller.  The nodes are wired and
the controller is networked via wifi.  I had to open up ports 6817 and 6818
between the wired and wireless sides of our network to get any connectivity.

Now when I do

srun -N2 hostname

the jobs show connection timeouts on the nodes:

[2020-02-06T14:24:37.183] launch task 60.0 request from UID:1000 GID:1000
HOST:10.204.18.232 PORT:19602
[2020-02-06T14:24:37.183] lllp_distribution jobid [60] implicit auto
binding: cores, dist 8192
[2020-02-06T14:24:37.183] _task_layout_lllp_cyclic
[2020-02-06T14:24:37.183] _lllp_generate_cpu_bind jobid [60]: mask_cpu,
0x0101
[2020-02-06T14:24:37.184] _run_prolog: run job script took usec=6
[2020-02-06T14:24:37.184] _run_prolog: prolog with lock for job 60 ran for
0 seconds
[2020-02-06T14:24:45.224] [60.0] error: connect io: Connection timed out
[2020-02-06T14:24:45.224] [60.0] error: IO setup failed: Connection timed
out
[2020-02-06T14:24:45.225] [60.0] error: job_manager exiting abnormally, rc
= 4021
[2020-02-06T14:24:59.538] [60.0] error: _send_launch_resp: Failed to send
RESPONSE_LAUNCH_TASKS: Connection timed out
[2020-02-06T14:24:59.551] [60.0] done with job

That node used port 19602 and the other node was using port 12496.  When I
did the srun again the jobs showed two different ports on the nodes (58040
and 32392).

How can I configure a network if srun is going to grab different ports each
time?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200206/e63535eb/attachment.htm>


More information about the slurm-users mailing list