<div dir="ltr"><br><div>In my case I tested firewall. But I'm wondering if the login nodes need to appear in the slurm.conf, and also if slurmd needs to be running on the login nodes in order for them to be a submit host? Either or both could be my issue.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 9, 2018 at 12:58 PM, Nicholas McCollum <span dir="ltr"><<a href="mailto:nmccollum@asc.edu" target="_blank">nmccollum@asc.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Connection refused makes me think a firewall issue.<br>
<br>
Assuming this is a test environment, could you try on the compute node:<br>
<br>
# iptables-save > iptables.bak<br>
# iptables -F && iptables -X<br>
<br>
Then test to see if it works.  To restore the firewall use:<br>
<br>
# iptables-restore < iptables.bak<br>
<br>
You may have to use...<br>
<br>
# systemctl stop firewalld<br>
# systemctl start firewalld<br>
<br>
If you use firewalld.<br>
<br>
---<br>
<br>
Nicholas McCollum - HPC Systems Expert<br>
Alabama Supercomputer Authority - CSRA<div class="HOEnZb"><div class="h5"><br>
<br>
On 03/09/2018 02:45 PM, Andy Georges wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi all,<br>
<br>
Cranked up the debug level a bit<br>
<br>
Job was not started when using:<br>
<br>
vsc40075@test2802 (banette) ~> /bin/salloc -N1 -n1 /bin/srun --pty bash -i<br>
salloc: Granted job allocation 42<br>
salloc: Waiting for resource configuration<br>
salloc: Nodes node2801 are ready for job<br>
<br>
For comparison purposes, running this on the master (head?) node:<br>
<br>
vsc40075@master23 () ~> /bin/salloc -N1 -n1 /bin/srun --pty bash -i<br>
salloc: Granted job allocation 43<br>
salloc: Waiting for resource configuration<br>
salloc: Nodes node2801 are ready for job<br>
vsc40075@node2801 () ~><br>
<br>
<br>
Below some more debug output from the hanging job.<br>
<br>
Kind regards,<br>
— Andy<br>
<br>
[2018-03-09T21:27:52.251] [42.0] debug:  _oom_event_monitor: started.<br>
[2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 objects<br>
[2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable<br>
[2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 objects<br>
[2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable<br>
[2018-03-09T21:27:52.251] [42.0] debug2: Entering _setup_normal_io<br>
[2018-03-09T21:27:52.251] [42.0] debug:    stdin uses a pty object<br>
[2018-03-09T21:27:52.251] [42.0] debug:  init pty size 23:119<br>
[2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 objects<br>
[2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable<br>
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused<br>
[2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream socket at <a href="http://10.141.21.202:33698" rel="noreferrer" target="_blank">10.141.21.202:33698</a>: Connection refused<br>
[2018-03-09T21:27:52.252] [42.0] error: slurm_open_msg_conn(pty_conn) 10.141.21.202,33698: Connection refused<br>
[2018-03-09T21:27:52.252] [42.0] debug4: adding IO connection (logical node rank 0)<br>
[2018-03-09T21:27:52.252] [42.0] debug4: connecting IO back to <a href="http://10.141.21.202:33759" rel="noreferrer" target="_blank">10.141.21.202:33759</a><br>
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused<br>
[2018-03-09T21:27:52.252] [42.0] debug3: Error connecting, picking new stream port<br>
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused<br>
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused<br>
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused<br>
[2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream socket at <a href="http://10.141.21.202:33759" rel="noreferrer" target="_blank">10.141.21.202:33759</a>: Connection refused<br>
[2018-03-09T21:27:52.252] [42.0] error: connect io: Connection refused<br>
[2018-03-09T21:27:52.252] [42.0] debug2: Leaving  _setup_normal_io<br>
[2018-03-09T21:27:52.253] [42.0] error: IO setup failed: Connection refused<br>
[2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_param: parameter 'freezer.state' set to 'THAWED' for '/sys/fs/cgroup/freezer/slurm/<wbr>uid_2540075/job_42/step_0'<br>
[2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/freezer'<br>
[2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/s<wbr>lurm/uid_2540075/job_42): Device or resource busy<br>
[2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/s<wbr>lurm/uid_2540075): Device or resource busy<br>
[2018-03-09T21:27:52.253] [42.0] debug:  step_terminate_monitor_stop signaling condition<br>
[2018-03-09T21:27:52.253] [42.0] debug4: eio: handling events for 1 objects<br>
[2018-03-09T21:27:52.253] [42.0] debug3: Called _msg_socket_readable<br>
[2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor will run for 60 secs<br>
[2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor is stopping<br>
[2018-03-09T21:27:52.253] [42.0] debug2: Sending SIGKILL to pgid 6414<br>
[2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/cpuset'<br>
[2018-03-09T21:27:52.265] [42.0] debug3: Took 1038 checks before stepd pid was removed from the step cgroup.<br>
[2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuset/sl<wbr>urm/uid_2540075/job_42): Device or resource busy<br>
[2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing job cpuset : Device or resource busy<br>
[2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuset/sl<wbr>urm/uid_2540075): Device or resource busy<br>
[2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing user cpuset : Device or resource busy<br>
[2018-03-09T21:27:52.315] [42.0] debug3: _oom_event_monitor: res: 1<br>
[2018-03-09T21:27:52.315] [42.0] _oom_event_monitor: oom-kill event count: 1<br>
[2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/sl<wbr>urm/uid_2540075/job_42): Device or resource busy<br>
[2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing job memcg : Device or resource busy<br>
[2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/sl<wbr>urm/uid_2540075): Device or resource busy<br>
[2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing user memcg : Device or resource busy<br>
[2018-03-09T21:27:52.315] [42.0] debug2: Before call to spank_fini()<br>
[2018-03-09T21:27:52.315] [42.0] debug2: After call to spank_fini()<br>
[2018-03-09T21:27:52.315] [42.0] error: job_manager exiting abnormally, rc = 4021<br>
[2018-03-09T21:27:52.315] [42.0] debug:  Sending launch resp rc=4021<br>
[2018-03-09T21:27:52.315] [42.0] debug2: slurm_connect failed: Connection refused<br>
[2018-03-09T21:27:52.315] [42.0] debug2: Error connecting slurm stream socket at <a href="http://10.141.21.202:37053" rel="noreferrer" target="_blank">10.141.21.202:37053</a>: Connection refused<br>
[2018-03-09T21:27:52.315] [42.0] error: _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS: Connection refused<br>
[2018-03-09T21:27:52.315] [42.0] debug2: Rank 0 has no children slurmstepd<br>
[2018-03-09T21:27:52.315] [42.0] debug2: _one_step_complete_msg: first=0, last=0<br>
[2018-03-09T21:27:52.315] [42.0] debug3: Rank 0 sending complete to slurmctld, range 0 to 0<br>
[2018-03-09T21:27:52.317] [42.0] debug4: eio: handling events for 1 objects<br>
[2018-03-09T21:27:52.317] [42.0] debug3: Called _msg_socket_readable<br>
[2018-03-09T21:27:52.317] [42.0] debug2:   false, shutdown<br>
[2018-03-09T21:27:52.317] [42.0] debug:  Message thread exited<br>
[2018-03-09T21:27:52.317] [42.0] done with job<br>
<br>
</blockquote>
<br>
</div></div></blockquote></div><br></div>