[slurm-users] Problem launching interactive jobs using srun

Fri Mar 9 11:04:02 MST 2018

I think that the piece you may be missing is --pty, but I also don't 
think that salloc is necessary.

The most simple command that I typically use is:

srun -N1 -n1 --pty bash -i

Mike

On 3/9/18 10:20 AM, Andy Georges wrote:
> Hi,
>
>
> I am trying to get interactive jobs to work from the machine we use as a login node, i.e., where the users of the cluster log into and from where they typically submit jobs.
>
>
> I submit the job as follows:
>
> vsc40075 at test2802 (banette) ~> /bin/salloc -N1 -n1 /bin/srun bash -i
> salloc: Granted job allocation 41
> salloc: Waiting for resource configuration
> salloc: Nodes node2801 are ready for job
>
> …
> hangs
>
>
> On node2801, the slurmd log has the following information:
>
>
> [2018-03-09T18:16:08.820] _run_prolog: run job script took usec=10379
> [2018-03-09T18:16:08.820] _run_prolog: prolog with lock for job 41 ran for 0 seconds
> [2018-03-09T18:16:08.829] [41.extern] task/cgroup: /slurm/uid_2540075/job_41: alloc=800MB mem.limit=800MB memsw.limit=880MB
> [2018-03-09T18:16:08.830] [41.extern] task/cgroup: /slurm/uid_2540075/job_41/step_extern: alloc=800MB mem.limit=800MB memsw.limit=880MB
> [2018-03-09T18:16:11.824] launch task 41.0 request from 2540075.2540075 at 10.141.21.202 (port 61928)
> [2018-03-09T18:16:11.824] lllp_distribution jobid [41] implicit auto binding: cores,one_thread, dist 1
> [2018-03-09T18:16:11.824] _task_layout_lllp_cyclic
> [2018-03-09T18:16:11.824] _lllp_generate_cpu_bind jobid [41]: mask_cpu,one_thread, 0x1
> [2018-03-09T18:16:11.834] [41.0] task/cgroup: /slurm/uid_2540075/job_41: alloc=800MB mem.limit=800MB memsw.limit=880MB
> [2018-03-09T18:16:11.834] [41.0] task/cgroup: /slurm/uid_2540075/job_41/step_0: alloc=800MB mem.limit=800MB memsw.limit=880MB
> [2018-03-09T18:16:11.836] [41.0] error: connect io: Connection refused
> [2018-03-09T18:16:11.836] [41.0] error: IO setup failed: Connection refused
> [2018-03-09T18:16:11.905] [41.0] _oom_event_monitor: oom-kill event count: 1
> [2018-03-09T18:16:11.905] [41.0] error: job_manager exiting abnormally, rc = 4021
> [2018-03-09T18:16:11.905] [41.0] error: _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS: Connection refused
> [2018-03-09T18:16:11.907] [41.0] done with job
>
>
> We are running slurm 17.11.4.
>
>
> When I change to the same user on both the master node (running slurmctld) and worker nodes (running slurmd), things work just fine. I would assume I need not run slurmd on the login node for this to work?
>
>
> Any pointers are appreciated,
> — Andy