[slurm-users] [External] Re: nodes going to down* and getting stuck in, that state

Thu Jun 17 19:35:07 UTC 2021

Did you every get this resolved? If so, what was the issue? I see this 
error:

> Can't open PID file /var/run/slurmd.pid (yet?)...ory 
I know systemctl shows slurmd running, but I've had some issues with 
'systemctl status' and always like to confirm a daemon is running with 
'ps'.

Prentice

On 6/4/21 12:31 PM, Herc Silverstein wrote:
> Hi,
>
> The slurmctld.log shows (for this node):
>
> ...
>
> [2021-05-25T00:12:27.481] sched: Allocate JobId=3402729 
> NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
> [2021-05-25T00:12:27.482] sched: Allocate JobId=3402730 
> NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
> [2021-05-25T00:12:27.482] sched: Allocate JobId=3402731 
> NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
> [2021-05-25T00:15:29.043] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding
>
>
> sinfo -R initially doesn't show it as problematic. Though I see it 
> goes into:
>
> gpu-t4-4x-ondemand                 up infinite      1  comp* 
> gpu-t4-4x-ondemand-44
>
> However, the node where slurmctld is running knows about it:
>
>  host gpu-t4-4x-ondemand-44
> gpu-t4-4x-ondemand-44.virtual-cluster.local has address 10.4.64.11
>
> and I can log in to the node:
>
> # systemctl status slurmd
> \u25cf slurmd.service - Slurm node daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; 
> vendor preset: disabled)
>    Active: active (running) since Tue 2021-05-25 00:12:24 UTC; 48s ago
>   Process: 1874 ExecStart=/opt/slurm/sbin/slurmd $SLURMD_OPTIONS 
> (code=exited, status=0/SUCCESS)
>  Main PID: 1876 (slurmd)
>     Tasks: 1
>    Memory: 11.6M
>    CGroup: /system.slice/slurmd.service
>            \u2514\u25001876 /opt/slurm/sbin/slurmd -f 
> /etc/slurm/slurm.conf
>
> May 25 00:12:23 gpu-t4-4x-ondemand-44.virtual-cluster.local 
> systemd[1]: Starting Slurm node daemon...
> May 25 00:12:24 gpu-t4-4x-ondemand-44.virtual-cluster.local 
> systemd[1]: Can't open PID file /var/run/slurmd.pid (yet?)...ory
> May 25 00:12:24 gpu-t4-4x-ondemand-44.virtual-cluster.local 
> systemd[1]: Started Slurm node daemon.
> Hint: Some lines were ellipsized, use -l to show in full.
>
> later:
>
> sinfo:
>
> gpu-t4-4x-ondemand                 up infinite      1  idle* 
> gpu-t4-4x-ondemand-44
>
> root at service(eigen2):log# sinfo -R
> REASON               USER      TIMESTAMP           NODELIST
> Not responding       slurm     2021-05-25T00:45:40 gpu-t4-4x-ondemand-44
>
> and slurmctld.log:
>
> [2021-05-25T00:15:29.043] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding
> [2021-05-25T00:19:16.397] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
> [2021-05-25T00:20:02.092] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:20:08.438] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
> [2021-05-25T00:25:02.931] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:25:04.903] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
> [2021-05-25T00:30:01.247] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:31:21.479] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
> [2021-05-25T00:35:01.359] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:35:41.756] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
> [2021-05-25T00:40:01.671] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:40:41.225] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
> [2021-05-25T00:45:01.430] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:45:40.071] error: Nodes gpu-t4-4x-ondemand-44 not 
> responding, setting DOWN
>
>
> This makes sense given what it thinks the state is.  However, it's 
> unclear why it thinks it's non-responding given that slurmd is running 
> and that it can be logged into.
>
> Herc
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210617/4be52b3e/attachment-0001.htm>