<html><head></head><body><div class="yahoo-style-wrap" style="font-family:Helvetica Neue, Helvetica, Arial, sans-serif;font-size:13px;"><div dir="ltr" data-setdir="false"><div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr">Hi all,</div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr"><br></div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr">I'm
 facing the following issue with a DGX A100 machine: I'm able to 
allocate resources, but the job fail when I try to execute srun, follow a
 detailed analysis of the incident:<br></div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr"><div><p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">$ salloc -n1 -N1 -p DEBUG -w dgx001 --time=2:0:0</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">salloc: Granted job allocation 1278</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">salloc: Waiting for resource configuration</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">salloc: Nodes dgx001 are ready for job</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">$ srun hostname</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">srun: error: slurm_receive_msgs: [[dgx001.hpc]:6818] failed: Socket timed out on send/recv operation</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">srun: error: Task launch for StepId=1278.0 failed on node dgx001: Socket timed out on send/recv operation</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">srun: error: Application launch failed: Socket timed out on send/recv operation</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">srun: Job step aborted</p>
<div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></div></div></div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">The DGX Slurm daemon version is:</div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></div>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">$ slurmd -V</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">slurm 22.05.8</p>
<div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></div><div>With OS :<br><p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">$ uname -a</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">Linux dgx001.hpc 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">$ lsb_release -a</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">No LSB modules are available.</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">Distributor ID:   Ubuntu</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">Description:      Ubuntu 20.04.5 LTS</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">Release:  20.04</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">Codename: focal</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></p>
<div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</div>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">With cgroup/v2 enabled as follow:</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">$ cat /etc/default/grub | grep cgroup</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1 cgroup_enable=memory swapaccount=1"</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></p></div><p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></p>
<div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr" data-setdir="false">Daemon status, even if cgroup/v2 is used, still present the process `slurmstepd` inside `slurmd.service` (the process <span>2250748 <span>slurmstepd</span> doesn't appear in the other machine under slurmd service)</span><br></div>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">$ systemctl status slurmd </p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">● slurmd.service - Slurm node daemon</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">     Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">    Drop-In: /etc/systemd/system/slurmd.service.d</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">             └─override.conf</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">     Active: active (running) since Fri 2023-02-10 14:14:21 CET; 20min ago</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">   Main PID: 2250012 (slurmd)</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">      Tasks: 5</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">     Memory: 10.9M</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">        CPU: 105ms</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">     CGroup: /system.slice/slurmd.service</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">             ├─2250012 /usr/local/sbin/slurmd -D -s -f /var/spool/slurm/d/conf-cache/slurm.conf -vvvvvv</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">             └─2250748 /usr/local/sbin/slurmstepd</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">Also is spawned the expected job in `slurmstepd.scope`:</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">$ systemctl status slurmstepd.scope</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">● slurmstepd.scope</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">     Loaded: loaded (/run/systemd/transient/slurmstepd.scope; transient)</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">  Transient: yes</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">     Active: active (abandoned) since Fri 2023-02-10 14:14:21 CET; 22min ago</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">      Tasks: 5</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">     Memory: 1.4M</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">        CPU: 28ms</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">     CGroup: /system.slice/slurmstepd.scope</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">             ├─job_1278</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">             │ └─step_extern</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">             │   ├─slurm</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">             │   │ └─2250609 slurmstepd: [1278.extern]</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">             │   └─user</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">             │     └─task_special</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">             │       └─2250619 sleep 100000000</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">             └─system</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">               └─2250024 /usr/local/sbin/slurmstepd infinity</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:14:21 dgx001.hpc systemd[1]: Started slurmstepd.scope.</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></p>
<div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr">The slurm.conf file works without problems with others machines and is also tested. </div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr"><br></div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr">Follow the service slurmd output:<br></div><br><p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr">$<span> journalctl -u slurmd</span><br></div>feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug:  Waiting for job 1278's prolog to complete
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug:  Finished wait for job 1278's prolog to complete</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb
 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug3: slurmstepd rank
 0 (dgx001), parent rank -1 (NONE), children 0, depth 0, max_depth 0</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug3: PLUGIN IDX</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug3: MPI CONF SEND</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: error: _send_slurmstepd_init failed</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: in the service_connection</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug2: Start processing RPC: REQUEST_TERMINATE_JOB</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug:  _rpc_terminate_job: uid = 3000 JobId=1278</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: state for jobid 998: ctime:1675770987 revoked:0 expires:2147483647</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: state for jobid 1184: ctime:1675953198 revoked:0 expires:2147483647</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: state for jobid 1217: ctime:1675967394 revoked:0 expires:2147483647</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: state for jobid 1278: ctime:1676034890 revoked:0 expires:2147483647</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug:  credential for job 1278 revoked</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug:  sent SUCCESS, waiting for step to start</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug:  Blocked waiting for JobId=1278, all steps</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">The function that fail is `_send_slurmstepd_init` at 'req.c:634'</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">if (mpi_conf_send_stepd(fd, job->mpi_plugin_id) !=</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">    SLURM_SUCCESS){</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">        debug3("MPI CONF SEND");</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">        goto rwfail;</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">}</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;"><br></p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">`mpi_conf_send_stepd` fail at `slurm_mpi.c:635`:</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">if ((index = _plugin_idx(plugin_id)) < 0) {</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">        debug3("PLUGIN IDX");</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">        goto rwfail;</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">}</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<br><p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">Configure settings:</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">./configure
 --prefix=/usr/local --libdir=/usr/lib64   --enable-pam  
--enable-really-no-cray --enable-shared   --enable-x11  --disable-static
    --disable-salloc-background  --disable-partial_attach  
--with-oneapi=no --with-shared-libslurm  --without-rpath --with-munge 
--enable-developer </p>
<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;">```</p><br><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr">I'm
 sorry for the hyper-detailed mail, but I've no idea how to cope with 
the issue, thus I hope that all details will be usefull to solve it. <br></div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr"><br></div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr">Thanks in advace,</div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr"><br></div><div style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px;" dir="ltr">Niccolo<br></div>
<br><br></div><div><br></div></div></div></body></html>