[slurm-users] 'srun hostname' hangs on the command line

Carlos Fenoy minibit at gmail.com
Tue Jul 17 04:54:46 MDT 2018


The communication from the compute nodes to the login nodes may be block by the firewall. That will prevent srun from running properly 

Sent from my iPhone

> On 17 Jul 2018, at 10:16, John Hearns <hearnsj at googlemail.com> wrote:
> 
> Ronan, as far as I can see this means that you cannot launch a job.
> 
> What state are the compute nodes in when you run sinfo?
> 
> 
>> On 17 July 2018 at 10:08, Buckley, Ronan <Ronan.Buckley at dell.com> wrote:
>> Yes, srun just hangs. Commands like sinfo and squeue run fine.
>> 
>> I also have no slurm logs in /var/log ??
>> 
>>  
>> 
>> From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of John Hearns
>> Sent: Tuesday, July 17, 2018 8:57 AM
>> 
>> 
>> To: Slurm User Community List
>> Subject: Re: [slurm-users] 'srun hostname' hangs on the command line
>>  
>> 
>> Ronan, sorry to ask but this is a bit unclear.
>> 
>>  
>> 
>> Are you unable to launch ANY sessions with srun?
>> 
>> In which case you need to look at the logs to see why the job is not being scheduled.
>> 
>>  
>> 
>> Is it only the hostname command which fails?
>> 
>>  
>> 
>> I would guess very much you have already run an ssh into a node and run the hostname command manually.
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> On 17 July 2018 at 09:50, Buckley, Ronan <Ronan.Buckley at dell.com> wrote:
>> 
>> Yes I do.
>> 
>>  
>> 
>> From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Williams, Gareth (IM&T, Clayton)
>> Sent: Tuesday, July 17, 2018 12:33 AM
>> To: Slurm User Community List
>> Subject: Re: [slurm-users] 'srun hostname' hangs on the command line
>> 
>>  
>> 
>> Do you get the same problem as a non-root user?
>> 
>>  
>> 
>> From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Buckley, Ronan
>> Sent: Tuesday, 17 July 2018 12:53 AM
>> To: slurm-users at lists.schedmd.com
>> Subject: [slurm-users] 'srun hostname' hangs on the command line
>> 
>>  
>> 
>> Hi All,
>> 
>>  
>> 
>> Verbose mode doesn’t show much.
>> 
>> I hashed out the hostnames.
>> 
>> Any ideas/suggestions?
>> 
>>  
>> 
>> # srun hostname
>> 
>> ^Csrun: interrupt (one more within 1 sec to abort)
>> 
>> srun: task 0: unknown
>> 
>> ^Z
>> 
>> [1]+  Stopped                 srun hostname
>> 
>> #
>> 
>>  
>> 
>> # srun -v hostname
>> 
>> srun: defined options for program `srun'
>> 
>> srun: --------------- ---------------------
>> 
>> srun: user           : `root'
>> 
>> srun: uid            : 0
>> 
>> srun: gid            : 0
>> 
>> srun: cwd            : /root
>> 
>> srun: ntasks         : 1 (default)
>> 
>> srun: nodes          : 1 (default)
>> 
>> srun: jobid          : 4294967294 (default)
>> 
>> srun: partition      : default
>> 
>> srun: profile        : `NotSet'
>> 
>> srun: job name       : `(null)'
>> 
>> srun: reservation    : `(null)'
>> 
>> srun: burst_buffer   : `(null)'
>> 
>> srun: wckey          : `(null)'
>> 
>> srun: cpu_freq_min   : 4294967294
>> 
>> srun: cpu_freq_max   : 4294967294
>> 
>> srun: cpu_freq_gov   : 4294967294
>> 
>> srun: switches       : -1
>> 
>> srun: wait-for-switches : -1
>> 
>> srun: distribution   : unknown
>> 
>> srun: cpu_bind       : default (0)
>> 
>> srun: mem_bind       : default (0)
>> 
>> srun: verbose        : 1
>> 
>> srun: slurmd_debug   : 0
>> 
>> srun: immediate      : false
>> 
>> srun: label output   : false
>> 
>> srun: unbuffered IO  : false
>> 
>> srun: overcommit     : false
>> 
>> srun: threads        : 60
>> 
>> srun: checkpoint_dir : /var/slurm/checkpoint
>> 
>> srun: wait           : 0
>> 
>> srun: nice           : -2
>> 
>> srun: account        : (null)
>> 
>> srun: comment        : (null)
>> 
>> srun: dependency     : (null)
>> 
>> srun: exclusive      : false
>> 
>> srun: bcast          : false
>> 
>> srun: qos            : (null)
>> 
>> srun: constraints    :
>> 
>> srun: geometry       : (null)
>> 
>> srun: reboot         : yes
>> 
>> srun: rotate         : no
>> 
>> srun: preserve_env   : false
>> 
>> srun: network        : (null)
>> 
>> srun: propagate      : NONE
>> 
>> srun: prolog         : (null)
>> 
>> srun: epilog         : (null)
>> 
>> srun: mail_type      : NONE
>> 
>> srun: mail_user      : (null)
>> 
>> srun: task_prolog    : (null)
>> 
>> srun: task_epilog    : (null)
>> 
>> srun: multi_prog     : no
>> 
>> srun: sockets-per-node  : -2
>> 
>> srun: cores-per-socket  : -2
>> 
>> srun: threads-per-core  : -2
>> 
>> srun: ntasks-per-node   : -2
>> 
>> srun: ntasks-per-socket : -2
>> 
>> srun: ntasks-per-core   : -2
>> 
>> srun: plane_size        : 4294967294
>> 
>> srun: core-spec         : NA
>> 
>> srun: power             :
>> 
>> srun: remote command    : `hostname'
>> 
>> srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
>> 
>> srun: Nodes ####### are ready for job
>> 
>> srun: jobid 50871: nodes(1):`#######', cpu counts: 64(x1)
>> 
>> srun: launching 50871.0 on host #######, 1 tasks: 0
>> 
>> srun: route default plugin loaded
>> 
>> srun: error: timeout waiting for task launch, started 0 of 1 tasks
>> 
>> srun: Job step 50871.0 aborted before step completely launched.
>> 
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> 
>> srun: error: Timed out waiting for job step to complete
>> 
>> #
>> 
>>  
>> 
>> Rgds
>> 
>>  
>> 
>>  
>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180717/96329640/attachment-0001.html>


More information about the slurm-users mailing list