<div dir="ltr">I'm having the same issue. The salloc command hangs on my login nodes, but works fine on the head node. My default salloc command is:<div><br></div><div>SallocDefaultCommand="/usr/bin/srun -n1 -N1 --pty --preserve-env $SHELL"<br></div><div><br></div><div>I'm on the OpenHPC slurm 17.02.9-69.2.</div><div><br></div><div>The log says the job is assigned, then eventually times out. I have tried srun directly with various tweaks, but it hangs every time. You can't ctl-C or ctl-Z out of it either, but the shell returns after the job times out. I killed the firewall on the login nodes but that made no difference. </div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 9, 2018 at 10:17 AM, Andy Georges <span dir="ltr"><<a href="mailto:Andy.Georges@ugent.be" target="_blank">Andy.Georges@ugent.be</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

Adding —pty makes no difference. I do not get a prompt and on the node the logs show an error. If —pty is used, the error is somewhat different compared to not using it but the end result is the same.<br>

<br>

My main issue is that giving the same command on the machines running slurmd and slurmctld just works.<br>

<br>

As far as srun is concerned, that’s what is advised for an interactive job, no?<br>

<br>

— Andy.<br>

<br>

Sent from my iPhone<br>

<div class="HOEnZb"><div class="h5"><br>

> On 9 Mar 2018, at 19:07, Michael Robbert <<a href="mailto:mrobbert@mines.edu">mrobbert@mines.edu</a>> wrote:<br>

><br>

> I think that the piece you may be missing is --pty, but I also don't think that salloc is necessary.<br>

><br>

> The most simple command that I typically use is:<br>

><br>

> srun -N1 -n1 --pty bash -i<br>

><br>

> Mike<br>

><br>

>> On 3/9/18 10:20 AM, Andy Georges wrote:<br>

>> Hi,<br>

>><br>

>><br>

>> I am trying to get interactive jobs to work from the machine we use as a login node, i.e., where the users of the cluster log into and from where they typically submit jobs.<br>

>><br>

>><br>

>> I submit the job as follows:<br>

>><br>

>> vsc40075@test2802 (banette) ~> /bin/salloc -N1 -n1 /bin/srun bash -i<br>

>> salloc: Granted job allocation 41<br>

>> salloc: Waiting for resource configuration<br>

>> salloc: Nodes node2801 are ready for job<br>

>><br>

>> …<br>

>> hangs<br>

>><br>

>><br>

>> On node2801, the slurmd log has the following information:<br>

>><br>

>><br>

>> [2018-03-09T18:16:08.820] _run_prolog: run job script took usec=10379<br>

>> [2018-03-09T18:16:08.820] _run_prolog: prolog with lock for job 41 ran for 0 seconds<br>

>> [2018-03-09T18:16:08.829] [41.extern] task/cgroup: /slurm/uid_2540075/job_41: alloc=800MB mem.limit=800MB memsw.limit=880MB<br>

>> [2018-03-09T18:16:08.830] [41.extern] task/cgroup: /slurm/uid_2540075/job_41/<wbr>step_extern: alloc=800MB mem.limit=800MB memsw.limit=880MB<br>

>> [2018-03-09T18:16:11.824] launch task 41.0 request from <a href="mailto:2540075.2540075@10.141.21.202">2540075.2540075@10.141.21.202</a> (port 61928)<br>

>> [2018-03-09T18:16:11.824] lllp_distribution jobid [41] implicit auto binding: cores,one_thread, dist 1<br>

>> [2018-03-09T18:16:11.824] _task_layout_lllp_cyclic<br>

>> [2018-03-09T18:16:11.824] _lllp_generate_cpu_bind jobid [41]: mask_cpu,one_thread, 0x1<br>

>> [2018-03-09T18:16:11.834] [41.0] task/cgroup: /slurm/uid_2540075/job_41: alloc=800MB mem.limit=800MB memsw.limit=880MB<br>

>> [2018-03-09T18:16:11.834] [41.0] task/cgroup: /slurm/uid_2540075/job_41/<wbr>step_0: alloc=800MB mem.limit=800MB memsw.limit=880MB<br>

>> [2018-03-09T18:16:11.836] [41.0] error: connect io: Connection refused<br>

>> [2018-03-09T18:16:11.836] [41.0] error: IO setup failed: Connection refused<br>

>> [2018-03-09T18:16:11.905] [41.0] _oom_event_monitor: oom-kill event count: 1<br>

>> [2018-03-09T18:16:11.905] [41.0] error: job_manager exiting abnormally, rc = 4021<br>

>> [2018-03-09T18:16:11.905] [41.0] error: _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS: Connection refused<br>

>> [2018-03-09T18:16:11.907] [41.0] done with job<br>

>><br>

>><br>

>> We are running slurm 17.11.4.<br>

>><br>

>><br>

>> When I change to the same user on both the master node (running slurmctld) and worker nodes (running slurmd), things work just fine. I would assume I need not run slurmd on the login node for this to work?<br>

>><br>

>><br>

>> Any pointers are appreciated,<br>

>> — Andy<br>

><br>

</div></div></blockquote></div><br></div>