<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">Hi Marcus</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">thanks for your contact. I'm new to
slurm deployment and I do not remember where I found this command
to check slurm setup. The SallocDefaultCommand is not defined in
my slurm.conf file</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">That is strange for me is that it works
on the node hosting slurmctld, and on the compute node too.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On the compute node, connected as root
and then using "su - begou":</div>
<div class="moz-cite-prefix">
<blockquote><font size="+1"><tt>[root@tenibre-0-0 ~]# <b>su -
begou</b></tt><tt><br>
</tt><tt>Last login: Tue Nov 10 20:49:45 CET 2020 on pts/0</tt><tt><br>
</tt><tt>[begou@tenibre-0-0 ~]$ <b>sinfo</b></tt><tt><br>
</tt><tt>PARTITION AVAIL TIMELIMIT NODES STATE
NODELIST</tt><tt><br>
</tt><tt>equipment_typeC up infinite 1 idle
tenibre-0-0</tt><tt><br>
</tt><tt>all* up infinite 1 idle
tenibre-0-0</tt><tt><br>
</tt><tt>[begou@tenibre-0-0 ~]$ <b>squeue</b></tt><tt><br>
</tt><tt> JOBID PARTITION NAME USER
ST TIME NODES NODELIST(REASON)</tt><tt><br>
</tt><tt>[begou@tenibre-0-0 ~]$ </tt><tt><b> salloc -n 1 srun
hostname </b></tt><tt><br>
</tt><tt>salloc: Granted job allocation 45</tt><tt><br>
</tt><tt>tenibre-0-0</tt><tt><br>
</tt><tt>salloc: Relinquishing job allocation 45</tt><tt><br>
</tt><tt>[begou@tenibre-0-0 ~]$ </tt><tt><br>
</tt></font>
<div class="moz-cite-prefix"><br>
</div>
</blockquote>
<div class="moz-cite-prefix">On the management node, connected as
root and then using "su - begou" (with no home directory
available):</div>
<blockquote>
<div class="moz-cite-prefix"> </div>
</blockquote>
<blockquote><font size="+1"><tt>[root@management1 ~]# <b>su -
begou</b></tt><br>
<tt>Creating home directory for begou.</tt><br>
<tt>Last login: Thu Nov 12 12:43:47 CET 2020 on pts/1</tt><br>
<tt>su: warning: cannot change directory to /HA/sources/begou:
No such file or directory</tt><br>
<tt>[begou@management1 root]$ <b>sinfo</b></tt><br>
<tt>PARTITION AVAIL TIMELIMIT NODES STATE NODELIST</tt><br>
<tt>equipment_typeC up infinite 1 idle tenibre-0-0</tt><br>
<tt>all* up infinite 1 idle tenibre-0-0</tt><br>
<tt>[begou@management1 root]$ <b>squeue</b></tt><br>
<tt> JOBID PARTITION NAME USER ST
TIME NODES NODELIST(REASON)</tt><br>
<tt>[begou@management1 root]$ <b>salloc -n 1 srun hostname </b></tt><br>
<tt>salloc: Granted job allocation 46</tt><br>
<tt>slurmstepd: error: couldn't chdir to `/root': Permission
denied: going to /tmp instead</tt><br>
<tt>tenibre-0-0</tt><br>
<tt>salloc: Relinquishing job allocation 46</tt><br>
<tt>[begou@management1 root]$ </tt></font><br>
</blockquote>
</div>
<div class="moz-cite-prefix">But not on the login node where I need
it....</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Le 12/11/2020 à 14:05, Marcus Wagner a
écrit :<br>
</div>
<blockquote type="cite"
cite="mid:0b367e78-d043-4dd3-a4fd-38701f0b0fce@rwthex-w2-a.rwth-ad.de">
<br>
for me at least, this is running as expected.
<br>
<br>
I'm not sure, why you use "sh" as the command for salloc, I never
saw that before. If you do not provide a command, the users
default shell will be started if the "SallocDefaultCommand" is not
set within slurm.conf
<br>
</blockquote>
<blockquote type="cite"
cite="mid:0b367e78-d043-4dd3-a4fd-38701f0b0fce@rwthex-w2-a.rwth-ad.de">So,
what does
<br>
$> salloc -n 1
<br>
$> srun hostname
<br>
</blockquote>
<b>This command hangs</b><b><br>
</b>
<blockquote type="cite"
cite="mid:0b367e78-d043-4dd3-a4fd-38701f0b0fce@rwthex-w2-a.rwth-ad.de"><b>
</b><br>
and what does
<br>
$> salloc -n 1 srun hostname
<br>
<br>
</blockquote>
<b>this command hangs too</b> from the login node.<b><br>
</b>
<blockquote type="cite"
cite="mid:0b367e78-d043-4dd3-a4fd-38701f0b0fce@rwthex-w2-a.rwth-ad.de"><b>
</b><br>
Best
<br>
Marcus
<br>
<br>
<br>
P.S.:
<br>
<br>
increase debugging might also help, e.g.
<br>
<br>
$> srun -vvvvv hostname
<br>
<br>
</blockquote>
<p>Yes I try this but wasn't able to find pertinent information. <b>This
is what I get</b>:</p>
<p><br>
</p>
<blockquote>
<p><font size="+1"><tt>[begou@tenibre ~]$ </tt><tt><b>salloc -n 1
"srun -vvvvv hostname"</b></tt><tt><br>
</tt><tt>salloc: Granted job allocation 43</tt><tt><br>
</tt><tt>salloc: error: _fork_command: Unable to find command
"srun -vvvvv hostname"</tt><tt><br>
</tt><tt>salloc: Relinquishing job allocation 43</tt><tt><br>
</tt><tt>[begou@tenibre ~]$ salloc -n 1 srun -vvvvv hostname</tt><tt><br>
</tt><tt>salloc: Granted job allocation 44</tt><tt><br>
</tt><tt>srun: defined options</tt><tt><br>
</tt><tt>srun: -------------------- --------------------</tt><tt><br>
</tt><tt>srun: (null) : tenibre-0-0</tt><tt><br>
</tt><tt>srun: jobid : 44</tt><tt><br>
</tt><tt>srun: job-name : srun</tt><tt><br>
</tt><tt>srun: nodes : 1</tt><tt><br>
</tt><tt>srun: ntasks : 1</tt><tt><br>
</tt><tt>srun: verbose : 5</tt><tt><br>
</tt><tt>srun: -------------------- --------------------</tt><tt><br>
</tt><tt>srun: end of defined options</tt><tt><br>
</tt><tt>srun: debug: propagating
RLIMIT_CPU=18446744073709551615</tt><tt><br>
</tt><tt>srun: debug: propagating
RLIMIT_FSIZE=18446744073709551615</tt><tt><br>
</tt><tt>srun: debug: propagating
RLIMIT_DATA=18446744073709551615</tt><tt><br>
</tt><tt>srun: debug: propagating RLIMIT_STACK=8388608</tt><tt><br>
</tt><tt>srun: debug: propagating
RLIMIT_CORE=18446744073709551615</tt><tt><br>
</tt><tt>srun: debug: propagating
RLIMIT_RSS=18446744073709551615</tt><tt><br>
</tt><tt>srun: debug: propagating RLIMIT_NPROC=512946</tt><tt><br>
</tt><tt>srun: debug: propagating RLIMIT_NOFILE=1024</tt><tt><br>
</tt><tt>srun: debug: propagating RLIMIT_MEMLOCK=65536</tt><tt><br>
</tt><tt>srun: debug: propagating
RLIMIT_AS=18446744073709551615</tt><tt><br>
</tt><tt>srun: debug: propagating SLURM_PRIO_PROCESS=0</tt><tt><br>
</tt><tt>srun: debug: propagating UMASK=0022</tt><tt><br>
</tt><tt>srun: debug2: srun PMI messages to port=44969</tt><tt><br>
</tt><tt>srun: debug3: Trying to load plugin
/usr/lib64/slurm/auth_munge.so</tt><tt><br>
</tt><tt>srun: debug: Munge authentication plugin loaded</tt><tt><br>
</tt><tt>srun: debug3: Success.</tt><tt><br>
</tt><tt>srun: jobid 44: nodes(1):`tenibre-0-0', cpu counts:
1(x1)</tt><tt><br>
</tt><tt>srun: debug2: creating job with 1 tasks</tt><tt><br>
</tt><tt>srun: debug: requesting job 44, user 23455, nodes 1
including ((null))</tt><tt><br>
</tt><tt>srun: debug: cpus 1, tasks 1, name hostname,
relative 65534</tt><tt><br>
</tt><tt>srun: CpuBindType=(null type)</tt><tt><br>
</tt><tt>srun: debug: Entering slurm_step_launch</tt><tt><br>
</tt><tt>srun: debug: mpi type = (null)</tt><tt><br>
</tt><tt>srun: debug: Using mpi/none</tt><tt><br>
</tt><tt>srun: debug: Entering _msg_thr_create()</tt><tt><br>
</tt><tt>srun: debug4: eio: handling events for 2 objects</tt><tt><br>
</tt><tt>srun: debug3: eio_message_socket_readable: shutdown 0
fd 10</tt><tt><br>
</tt><tt>srun: debug3: eio_message_socket_readable: shutdown 0
fd 6</tt><tt><br>
</tt><tt>srun: debug: initialized stdio listening socket,
port 34531</tt><tt><br>
</tt><tt>srun: debug: Started IO server thread
(139644034881280)</tt><tt><br>
</tt><tt>srun: debug: Entering _launch_tasks</tt><tt><br>
</tt><tt>srun: debug3: IO thread pid = 1733164</tt><tt><br>
</tt><tt>srun: debug4: eio: handling events for 4 objects</tt><tt><br>
</tt><tt>srun: launching 44.0 on host tenibre-0-0, 1 tasks: 0</tt><tt><br>
</tt><tt>srun: debug3: uid:23455 gid:1036
cwd:/HA/sources/begou 0</tt><tt><br>
</tt><tt>srun: debug2: Called _file_readable</tt><tt><br>
</tt><tt>srun: debug3: false, all ioservers not yet
initialized</tt><tt><br>
</tt><tt>srun: debug2: Called _file_writable</tt><tt><br>
</tt><tt>srun: debug3: false</tt><tt><br>
</tt><tt>srun: debug3: Trying to load plugin
/usr/lib64/slurm/route_default.so</tt><tt><br>
</tt><tt>srun: debug3: eof is false</tt><tt><br>
</tt><tt>srun: debug2: Called _file_writable</tt><tt><br>
</tt><tt>srun: debug3: false</tt><tt><br>
</tt><tt>srun: debug3: eof is false</tt><tt><br>
</tt><tt>srun: debug3: Called _listening_socket_readable</tt><tt><br>
</tt><tt>srun: route default plugin loaded</tt><tt><br>
</tt><tt>srun: debug3: Success.</tt><tt><br>
</tt><tt>srun: debug2: Tree head got back 0 looking for 1</tt><tt><br>
</tt><tt>srun: debug3: Tree sending to tenibre-0-0</tt><tt><br>
</tt><tt>srun: debug4: orig_timeout was 20000 we have 0 steps
and a timeout of 20000</tt><tt><br>
</tt><tt>srun: debug2: Tree head got back 1</tt><tt><br>
</tt><tt>srun: debug: launch returned msg_rc=0 err=0
type=8001</tt><tt><br>
</tt><tt>srun: error: timeout waiting for task launch, started
0 of 1 tasks</tt><tt><br>
</tt><tt>srun: Job step 44.0 aborted before step completely
launched.</tt><tt><br>
</tt><tt>srun: Job step aborted: Waiting up to 32 seconds for
job step to finish.</tt><tt><br>
</tt><tt>srun: error: Timed out waiting for job step to
complete</tt><tt><br>
</tt><tt>srun: debug4: eio: handling events for 2 objects</tt><tt><br>
</tt><tt>srun: debug3: eio_message_socket_readable: shutdown 1
fd 10</tt><tt><br>
</tt><tt>srun: debug2: false, shutdown</tt><tt><br>
</tt><tt>srun: debug3: eio_message_socket_readable: shutdown 1
fd 6</tt><tt><br>
</tt><tt>srun: debug2: false, shutdown</tt><tt><br>
</tt><tt>srun: debug4: eio: handling events for 4 objects</tt><tt><br>
</tt><tt>srun: debug2: Called _file_readable</tt><tt><br>
</tt><tt>srun: debug3: false, shutdown</tt><tt><br>
</tt><tt>srun: debug2: Called _file_writable</tt><tt><br>
</tt><tt>srun: debug3: false</tt><tt><br>
</tt><tt>srun: debug3: eof is false</tt><tt><br>
</tt><tt>srun: debug2: Called _file_writable</tt><tt><br>
</tt><tt>srun: debug3: false</tt><tt><br>
</tt><tt>srun: debug3: eof is false</tt><tt><br>
</tt><tt>srun: debug3: Called _listening_socket_readable</tt><tt><br>
</tt><tt>srun: debug2: false, shutdown</tt><tt><br>
</tt><tt>srun: debug: IO thread exiting</tt><tt><br>
</tt><tt>salloc: Relinquishing job allocation 44</tt><tt><br>
</tt><tt>[begou@tenibre ~]$ </tt><tt><br>
</tt><tt><br>
</tt></font></p>
</blockquote>
<p>This problem looks really strange for me....</p>
<p>Patrick<br>
</p>
<p><br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:0b367e78-d043-4dd3-a4fd-38701f0b0fce@rwthex-w2-a.rwth-ad.de">Am
10.11.2020 um 11:54 schrieb Patrick Bégou:
<br>
<blockquote type="cite">Hi,
<br>
<br>
I'm new to slurm (as admin) and I need some help. Testing my
initial setup with:
<br>
<br>
[begou@tenibre ~]$ *salloc -n 1 sh*
<br>
salloc: Granted job allocation 11
<br>
sh-4.4$ *squeue*
<br>
JOBID PARTITION NAME USER ST
TIME NODES NODELIST(REASON)
<br>
*11 * all sh begou R 0:16 1 tenibre-0-0
<br>
sh-4.4$*srun /usr/bin/hostname*
<br>
srun: error: timeout waiting for task launch, started 0 of 1
tasks
<br>
srun: Job step 11.0 aborted before step completely launched.
<br>
srun: Job step aborted: Waiting up to 32 seconds for job
step to finish.
<br>
srun: error: Timed out waiting for job step to complete
<br>
<br>
I check the connections:
<br>
<br>
*tenibre is the login node* (no daemon running)
<br>
<br>
nc -v tenibre-0-0 6818
<br>
nc -v management1 6817
<br>
<br>
*management1 is the management node* (slurmctld running)
<br>
<br>
nc -v tenibre-0-0 6818
<br>
<br>
*tenibre-0-0 is the first compute node* (slurmd running)
<br>
<br>
nc -v management1 6817
<br>
<br>
All tests return "/Ncat: Connected.../"
<br>
<br>
The command "id begou" works on all nodes and I can reach my
home directory on the login node and on the compute node.
<br>
<br>
On the compute node slurmd.log shows:
<br>
<br>
[2020-11-10T11:21:38.050]*launch task* *11.0 *request from
UID:23455 GID:1036 HOST:172.30.1.254 PORT:42220
<br>
[2020-11-10T11:21:38.050] debug: Checking credential with
508 bytes of sig data
<br>
[2020-11-10T11:21:38.050] _run_prolog: run job script took
usec=12
<br>
[2020-11-10T11:21:38.050] _run_prolog: prolog with lock for
job 11 ran for 0 seconds
<br>
[2020-11-10T11:21:38.053] debug: AcctGatherEnergy NONE
plugin loaded
<br>
[2020-11-10T11:21:38.053] debug: AcctGatherProfile NONE
plugin loaded
<br>
[2020-11-10T11:21:38.053] debug: AcctGatherInterconnect
NONE plugin loaded
<br>
[2020-11-10T11:21:38.053] debug: AcctGatherFilesystem NONE
plugin loaded
<br>
[2020-11-10T11:21:38.053] debug: switch NONE plugin loaded
<br>
[2020-11-10T11:21:38.054] [11.0] debug: Job accounting
gather NOT_INVOKED plugin loaded
<br>
[2020-11-10T11:21:38.054] [11.0] debug: Message thread
started pid = 12099
<br>
[2020-11-10T11:21:38.054] debug:
task_p_slurmd_reserve_resources: 11 0
<br>
[2020-11-10T11:21:38.068] [11.0] debug: task NONE plugin
loaded
<br>
[2020-11-10T11:21:38.068] [11.0] debug: Checkpoint plugin
loaded: checkpoint/none
<br>
[2020-11-10T11:21:38.068] [11.0] Munge credential signature
plugin loaded
<br>
[2020-11-10T11:21:38.068] [11.0] debug: job_container none
plugin loaded
<br>
[2020-11-10T11:21:38.068] [11.0] debug: mpi type = pmi2
<br>
[2020-11-10T11:21:38.068] [11.0] debug:
xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm'
already exists
<br>
[2020-11-10T11:21:38.068] [11.0] debug: spank: opening
plugin stack /etc/slurm/plugstack.conf
<br>
[2020-11-10T11:21:38.068] [11.0] debug: mpi type = (null)
<br>
[2020-11-10T11:21:38.068] [11.0] debug: using mpi/pmi2
<br>
[2020-11-10T11:21:38.068] [11.0] debug:
_setup_stepd_job_info: SLURM_STEP_RESV_PORTS not found in env
<br>
[2020-11-10T11:21:38.068] [11.0] debug: mpi/pmi2: setup
sockets
<br>
[2020-11-10T11:21:38.069] [11.0] debug: mpi/pmi2: started
agent thread
<br>
[2020-11-10T11:21:38.069] [11.0]*error: connect io: No route
to host*
<br>
[2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No
route to host
<br>
[2020-11-10T11:21:38.069] [11.0] debug:
step_terminate_monitor_stop signaling condition
<br>
[2020-11-10T11:21:38.069] [11.0] error: job_manager exiting
abnormally, rc = 4021
<br>
[2020-11-10T11:21:38.069] [11.0] debug: Sending launch resp
rc=4021
<br>
[2020-11-10T11:21:38.069] [11.0] debug:
_send_srun_resp_msg: 0/5 *failed to send msg type 6002: No route
to host*
<br>
[2020-11-10T11:21:38.169] [11.0] debug:
_send_srun_resp_msg: 1/5 failed to send msg type 6002: No route
to host
<br>
[2020-11-10T11:21:38.370] [11.0] debug:
_send_srun_resp_msg: 2/5 failed to send msg type 6002: No route
to host
<br>
[2020-11-10T11:21:38.770] [11.0] debug:
_send_srun_resp_msg: 3/5 failed to send msg type 6002: No route
to host
<br>
[2020-11-10T11:21:39.570] [11.0] debug:
_send_srun_resp_msg: 4/5 failed to send msg type 6002: No route
to host
<br>
[2020-11-10T11:21:40.370] [11.0] debug:
_send_srun_resp_msg: 5/5 failed to send msg type 6002: No route
to host
<br>
[2020-11-10T11:21:40.372] [11.0] debug: Message thread
exited
<br>
[2020-11-10T11:21:40.372] [11.0] debug: mpi/pmi2: agent
thread exit
<br>
[2020-11-10T11:21:40.372] [11.0] *done with job*
<br>
<br>
<br>
But I do not understand what this "No route to host" means.
<br>
<br>
<br>
Thanks for your help.
<br>
<br>
Patrick
<br>
<br>
<br>
</blockquote>
<br>
</blockquote>
<p><br>
</p>
</body>
</html>