[slurm-users] Problem launching interactive jobs using srun
Andy Georges
Andy.Georges at UGent.be
Fri Mar 9 11:17:41 MST 2018
Hi,
Adding —pty makes no difference. I do not get a prompt and on the node the logs show an error. If —pty is used, the error is somewhat different compared to not using it but the end result is the same.
My main issue is that giving the same command on the machines running slurmd and slurmctld just works.
As far as srun is concerned, that’s what is advised for an interactive job, no?
— Andy.
Sent from my iPhone
> On 9 Mar 2018, at 19:07, Michael Robbert <mrobbert at mines.edu> wrote:
>
> I think that the piece you may be missing is --pty, but I also don't think that salloc is necessary.
>
> The most simple command that I typically use is:
>
> srun -N1 -n1 --pty bash -i
>
> Mike
>
>> On 3/9/18 10:20 AM, Andy Georges wrote:
>> Hi,
>>
>>
>> I am trying to get interactive jobs to work from the machine we use as a login node, i.e., where the users of the cluster log into and from where they typically submit jobs.
>>
>>
>> I submit the job as follows:
>>
>> vsc40075 at test2802 (banette) ~> /bin/salloc -N1 -n1 /bin/srun bash -i
>> salloc: Granted job allocation 41
>> salloc: Waiting for resource configuration
>> salloc: Nodes node2801 are ready for job
>>
>> …
>> hangs
>>
>>
>> On node2801, the slurmd log has the following information:
>>
>>
>> [2018-03-09T18:16:08.820] _run_prolog: run job script took usec=10379
>> [2018-03-09T18:16:08.820] _run_prolog: prolog with lock for job 41 ran for 0 seconds
>> [2018-03-09T18:16:08.829] [41.extern] task/cgroup: /slurm/uid_2540075/job_41: alloc=800MB mem.limit=800MB memsw.limit=880MB
>> [2018-03-09T18:16:08.830] [41.extern] task/cgroup: /slurm/uid_2540075/job_41/step_extern: alloc=800MB mem.limit=800MB memsw.limit=880MB
>> [2018-03-09T18:16:11.824] launch task 41.0 request from 2540075.2540075 at 10.141.21.202 (port 61928)
>> [2018-03-09T18:16:11.824] lllp_distribution jobid [41] implicit auto binding: cores,one_thread, dist 1
>> [2018-03-09T18:16:11.824] _task_layout_lllp_cyclic
>> [2018-03-09T18:16:11.824] _lllp_generate_cpu_bind jobid [41]: mask_cpu,one_thread, 0x1
>> [2018-03-09T18:16:11.834] [41.0] task/cgroup: /slurm/uid_2540075/job_41: alloc=800MB mem.limit=800MB memsw.limit=880MB
>> [2018-03-09T18:16:11.834] [41.0] task/cgroup: /slurm/uid_2540075/job_41/step_0: alloc=800MB mem.limit=800MB memsw.limit=880MB
>> [2018-03-09T18:16:11.836] [41.0] error: connect io: Connection refused
>> [2018-03-09T18:16:11.836] [41.0] error: IO setup failed: Connection refused
>> [2018-03-09T18:16:11.905] [41.0] _oom_event_monitor: oom-kill event count: 1
>> [2018-03-09T18:16:11.905] [41.0] error: job_manager exiting abnormally, rc = 4021
>> [2018-03-09T18:16:11.905] [41.0] error: _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS: Connection refused
>> [2018-03-09T18:16:11.907] [41.0] done with job
>>
>>
>> We are running slurm 17.11.4.
>>
>>
>> When I change to the same user on both the master node (running slurmctld) and worker nodes (running slurmd), things work just fine. I would assume I need not run slurmd on the login node for this to work?
>>
>>
>> Any pointers are appreciated,
>> — Andy
>
More information about the slurm-users
mailing list