[slurm-users] nodes going to down* and getting stuck in, that state
Brian Andrus
toomuchit at gmail.com
Fri Jun 4 16:49:38 UTC 2021
Oh, also ensure the dns is working properly on the node. It could be
that it isn't able to map the name to ip of the master.
Brian Andrus
On 6/4/2021 9:31 AM, Herc Silverstein wrote:
> Hi,
>
> The slurmctld.log shows (for this node):
>
> ...
>
> [2021-05-25T00:12:27.481] sched: Allocate JobId=3402729
> NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
> [2021-05-25T00:12:27.482] sched: Allocate JobId=3402730
> NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
> [2021-05-25T00:12:27.482] sched: Allocate JobId=3402731
> NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
> [2021-05-25T00:15:29.043] error: Nodes gpu-t4-4x-ondemand-44 not
> responding
>
>
> sinfo -R initially doesn't show it as problematic. Though I see it
> goes into:
>
> gpu-t4-4x-ondemand up infinite 1 comp*
> gpu-t4-4x-ondemand-44
>
> However, the node where slurmctld is running knows about it:
>
> host gpu-t4-4x-ondemand-44
> gpu-t4-4x-ondemand-44.virtual-cluster.local has address 10.4.64.11
>
> and I can log in to the node:
>
> # systemctl status slurmd
> \u25cf slurmd.service - Slurm node daemon
> Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled;
> vendor preset: disabled)
> Active: active (running) since Tue 2021-05-25 00:12:24 UTC; 48s ago
> Process: 1874 ExecStart=/opt/slurm/sbin/slurmd $SLURMD_OPTIONS
> (code=exited, status=0/SUCCESS)
> Main PID: 1876 (slurmd)
> Tasks: 1
> Memory: 11.6M
> CGroup: /system.slice/slurmd.service
> \u2514\u25001876 /opt/slurm/sbin/slurmd -f
> /etc/slurm/slurm.conf
>
> May 25 00:12:23 gpu-t4-4x-ondemand-44.virtual-cluster.local
> systemd[1]: Starting Slurm node daemon...
> May 25 00:12:24 gpu-t4-4x-ondemand-44.virtual-cluster.local
> systemd[1]: Can't open PID file /var/run/slurmd.pid (yet?)...ory
> May 25 00:12:24 gpu-t4-4x-ondemand-44.virtual-cluster.local
> systemd[1]: Started Slurm node daemon.
> Hint: Some lines were ellipsized, use -l to show in full.
>
> later:
>
> sinfo:
>
> gpu-t4-4x-ondemand up infinite 1 idle*
> gpu-t4-4x-ondemand-44
>
> root at service(eigen2):log# sinfo -R
> REASON USER TIMESTAMP NODELIST
> Not responding slurm 2021-05-25T00:45:40 gpu-t4-4x-ondemand-44
>
> and slurmctld.log:
>
> [2021-05-25T00:15:29.043] error: Nodes gpu-t4-4x-ondemand-44 not
> responding
> [2021-05-25T00:19:16.397] error: Nodes gpu-t4-4x-ondemand-44 not
> responding, setting DOWN
> [2021-05-25T00:20:02.092] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:20:08.438] error: Nodes gpu-t4-4x-ondemand-44 not
> responding, setting DOWN
> [2021-05-25T00:25:02.931] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:25:04.903] error: Nodes gpu-t4-4x-ondemand-44 not
> responding, setting DOWN
> [2021-05-25T00:30:01.247] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:31:21.479] error: Nodes gpu-t4-4x-ondemand-44 not
> responding, setting DOWN
> [2021-05-25T00:35:01.359] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:35:41.756] error: Nodes gpu-t4-4x-ondemand-44 not
> responding, setting DOWN
> [2021-05-25T00:40:01.671] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:40:41.225] error: Nodes gpu-t4-4x-ondemand-44 not
> responding, setting DOWN
> [2021-05-25T00:45:01.430] powering down node gpu-t4-4x-ondemand-44
> [2021-05-25T00:45:40.071] error: Nodes gpu-t4-4x-ondemand-44 not
> responding, setting DOWN
>
>
> This makes sense given what it thinks the state is. However, it's
> unclear why it thinks it's non-responding given that slurmd is running
> and that it can be logged into.
>
> Herc
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210604/39e55fc8/attachment.htm>
More information about the slurm-users
mailing list