[slurm-users] nodes going to down* and getting stuck in, that state

Fri Jun 4 16:49:38 UTC 2021

Oh, also ensure the dns is working properly on the node. It could be 
that it isn't able to map the name to ip of the master.

Brian Andrus

On 6/4/2021 9:31 AM, Herc Silverstein wrote:
> Hi,
>
> The slurmctld.log shows (for this node):
>
> ...
>
> [2021-05-25T00:12:27.481] sched: Allocate JobId=3402729 
> NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
> [2021-05-25T00:12:27.482] sched: Allocate JobId=3402730 
> NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
> [2021-05-25T00:12:27.482] sched: Allocate JobId=3402731 
> NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
> [2021-05-25T00:15:29.043] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding
>
>
> sinfo -R initially doesn't show it as problematic. Though I see it 
> goes into:
>
> gpu-t4-4x-ondemand                 up infinite      1  comp* 
> gpu-t4-4x-ondemand-44
>
> However, the node where slurmctld is running knows about it:
>
>  host gpu-t4-4x-ondemand-44
> gpu-t4-4x-ondemand-44.virtual-cluster.local has address 10.4.64.11
>
> and I can log in to the node:
>
> # systemctl status slurmd
> \u25cf slurmd.service - Slurm node daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; 
> vendor preset: disabled)
>    Active: active (running) since Tue 2021-05-25 00:12:24 UTC; 48s ago
>   Process: 1874 ExecStart=/opt/slurm/sbin/slurmd $SLURMD_OPTIONS 
> (code=exited, status=0/SUCCESS)
>  Main PID: 1876 (slurmd)
>     Tasks: 1
>    Memory: 11.6M
>    CGroup: /system.slice/slurmd.service
>            \u2514\u25001876 /opt/slurm/sbin/slurmd -f 
> /etc/slurm/slurm.conf
>
> May 25 00:12:23 gpu-t4-4x-ondemand-44.virtual-cluster.local 
> systemd[1]: Starting Slurm node daemon...
> May 25 00:12:24 gpu-t4-4x-ondemand-44.virtual-cluster.local 
> systemd[1]: Can't open PID file /var/run/slurmd.pid (yet?)...ory
> May 25 00:12:24 gpu-t4-4x-ondemand-44.virtual-cluster.local 
> systemd[1]: Started Slurm node daemon.
> Hint: Some lines were ellipsized, use -l to show in full.
>
> later:
>
> sinfo:
>
> gpu-t4-4x-ondemand                 up infinite      1  idle* 
> gpu-t4-4x-ondemand-44
>
> root at service(eigen2):log# sinfo -R
> REASON               USER      TIMESTAMP           NODELIST
> Not responding       slurm     2021-05-25T00:45:40 gpu-t4-4x-ondemand-44
>
> and slurmctld.log:
>
> [2021-05-25T00:15:29.043] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding
> [2021-05-25T00:19:16.397] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
> [2021-05-25T00:20:02.092] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:20:08.438] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
> [2021-05-25T00:25:02.931] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:25:04.903] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
> [2021-05-25T00:30:01.247] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:31:21.479] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
> [2021-05-25T00:35:01.359] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:35:41.756] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
> [2021-05-25T00:40:01.671] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:40:41.225] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
> [2021-05-25T00:45:01.430] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:45:40.071] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
>
>
> This makes sense given what it thinks the state is.  However, it's 
> unclear why it thinks it's non-responding given that slurmd is running 
> and that it can be logged into.
>
> Herc
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210604/39e55fc8/attachment.htm>