[slurm-users] Problem launching interactive jobs using srun

Mark M plaktau at gmail.com
Fri Mar 9 14:10:24 MST 2018


In my case I tested firewall. But I'm wondering if the login nodes need to
appear in the slurm.conf, and also if slurmd needs to be running on the
login nodes in order for them to be a submit host? Either or both could be
my issue.

On Fri, Mar 9, 2018 at 12:58 PM, Nicholas McCollum <nmccollum at asc.edu>
wrote:

> Connection refused makes me think a firewall issue.
>
> Assuming this is a test environment, could you try on the compute node:
>
> # iptables-save > iptables.bak
> # iptables -F && iptables -X
>
> Then test to see if it works.  To restore the firewall use:
>
> # iptables-restore < iptables.bak
>
> You may have to use...
>
> # systemctl stop firewalld
> # systemctl start firewalld
>
> If you use firewalld.
>
> ---
>
> Nicholas McCollum - HPC Systems Expert
> Alabama Supercomputer Authority - CSRA
>
>
> On 03/09/2018 02:45 PM, Andy Georges wrote:
>
>> Hi all,
>>
>> Cranked up the debug level a bit
>>
>> Job was not started when using:
>>
>> vsc40075 at test2802 (banette) ~> /bin/salloc -N1 -n1 /bin/srun --pty bash
>> -i
>> salloc: Granted job allocation 42
>> salloc: Waiting for resource configuration
>> salloc: Nodes node2801 are ready for job
>>
>> For comparison purposes, running this on the master (head?) node:
>>
>> vsc40075 at master23 () ~> /bin/salloc -N1 -n1 /bin/srun --pty bash -i
>> salloc: Granted job allocation 43
>> salloc: Waiting for resource configuration
>> salloc: Nodes node2801 are ready for job
>> vsc40075 at node2801 () ~>
>>
>>
>> Below some more debug output from the hanging job.
>>
>> Kind regards,
>> — Andy
>>
>> [2018-03-09T21:27:52.251] [42.0] debug:  _oom_event_monitor: started.
>> [2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1
>> objects
>> [2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable
>> [2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1
>> objects
>> [2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable
>> [2018-03-09T21:27:52.251] [42.0] debug2: Entering _setup_normal_io
>> [2018-03-09T21:27:52.251] [42.0] debug:    stdin uses a pty object
>> [2018-03-09T21:27:52.251] [42.0] debug:  init pty size 23:119
>> [2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1
>> objects
>> [2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable
>> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection
>> refused
>> [2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream
>> socket at 10.141.21.202:33698: Connection refused
>> [2018-03-09T21:27:52.252] [42.0] error: slurm_open_msg_conn(pty_conn)
>> 10.141.21.202,33698: Connection refused
>> [2018-03-09T21:27:52.252] [42.0] debug4: adding IO connection (logical
>> node rank 0)
>> [2018-03-09T21:27:52.252] [42.0] debug4: connecting IO back to
>> 10.141.21.202:33759
>> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection
>> refused
>> [2018-03-09T21:27:52.252] [42.0] debug3: Error connecting, picking new
>> stream port
>> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection
>> refused
>> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection
>> refused
>> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection
>> refused
>> [2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream
>> socket at 10.141.21.202:33759: Connection refused
>> [2018-03-09T21:27:52.252] [42.0] error: connect io: Connection refused
>> [2018-03-09T21:27:52.252] [42.0] debug2: Leaving  _setup_normal_io
>> [2018-03-09T21:27:52.253] [42.0] error: IO setup failed: Connection
>> refused
>> [2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_param: parameter
>> 'freezer.state' set to 'THAWED' for '/sys/fs/cgroup/freezer/slurm/
>> uid_2540075/job_42/step_0'
>> [2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param:
>> parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/freezer'
>> [2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete:
>> rmdir(/sys/fs/cgroup/freezer/slurm/uid_2540075/job_42): Device or
>> resource busy
>> [2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete:
>> rmdir(/sys/fs/cgroup/freezer/slurm/uid_2540075): Device or resource busy
>> [2018-03-09T21:27:52.253] [42.0] debug:  step_terminate_monitor_stop
>> signaling condition
>> [2018-03-09T21:27:52.253] [42.0] debug4: eio: handling events for 1
>> objects
>> [2018-03-09T21:27:52.253] [42.0] debug3: Called _msg_socket_readable
>> [2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor will run
>> for 60 secs
>> [2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor is
>> stopping
>> [2018-03-09T21:27:52.253] [42.0] debug2: Sending SIGKILL to pgid 6414
>> [2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param:
>> parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/cpuset'
>> [2018-03-09T21:27:52.265] [42.0] debug3: Took 1038 checks before stepd
>> pid was removed from the step cgroup.
>> [2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete:
>> rmdir(/sys/fs/cgroup/cpuset/slurm/uid_2540075/job_42): Device or
>> resource busy
>> [2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing job
>> cpuset : Device or resource busy
>> [2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete:
>> rmdir(/sys/fs/cgroup/cpuset/slurm/uid_2540075): Device or resource busy
>> [2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing user
>> cpuset : Device or resource busy
>> [2018-03-09T21:27:52.315] [42.0] debug3: _oom_event_monitor: res: 1
>> [2018-03-09T21:27:52.315] [42.0] _oom_event_monitor: oom-kill event
>> count: 1
>> [2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete:
>> rmdir(/sys/fs/cgroup/memory/slurm/uid_2540075/job_42): Device or
>> resource busy
>> [2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing job
>> memcg : Device or resource busy
>> [2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete:
>> rmdir(/sys/fs/cgroup/memory/slurm/uid_2540075): Device or resource busy
>> [2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing user
>> memcg : Device or resource busy
>> [2018-03-09T21:27:52.315] [42.0] debug2: Before call to spank_fini()
>> [2018-03-09T21:27:52.315] [42.0] debug2: After call to spank_fini()
>> [2018-03-09T21:27:52.315] [42.0] error: job_manager exiting abnormally,
>> rc = 4021
>> [2018-03-09T21:27:52.315] [42.0] debug:  Sending launch resp rc=4021
>> [2018-03-09T21:27:52.315] [42.0] debug2: slurm_connect failed: Connection
>> refused
>> [2018-03-09T21:27:52.315] [42.0] debug2: Error connecting slurm stream
>> socket at 10.141.21.202:37053: Connection refused
>> [2018-03-09T21:27:52.315] [42.0] error: _send_launch_resp: Failed to send
>> RESPONSE_LAUNCH_TASKS: Connection refused
>> [2018-03-09T21:27:52.315] [42.0] debug2: Rank 0 has no children slurmstepd
>> [2018-03-09T21:27:52.315] [42.0] debug2: _one_step_complete_msg: first=0,
>> last=0
>> [2018-03-09T21:27:52.315] [42.0] debug3: Rank 0 sending complete to
>> slurmctld, range 0 to 0
>> [2018-03-09T21:27:52.317] [42.0] debug4: eio: handling events for 1
>> objects
>> [2018-03-09T21:27:52.317] [42.0] debug3: Called _msg_socket_readable
>> [2018-03-09T21:27:52.317] [42.0] debug2:   false, shutdown
>> [2018-03-09T21:27:52.317] [42.0] debug:  Message thread exited
>> [2018-03-09T21:27:52.317] [42.0] done with job
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180309/59ff65a5/attachment.html>


More information about the slurm-users mailing list