[slurm-users] nodes going to down* and getting stuck in, that state
Herc Silverstein
herc.silverstein at schrodinger.com
Fri Jun 4 16:31:30 UTC 2021
Hi,
The slurmctld.log shows (for this node):
...
[2021-05-25T00:12:27.481] sched: Allocate JobId=3402729
NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
[2021-05-25T00:12:27.482] sched: Allocate JobId=3402730
NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
[2021-05-25T00:12:27.482] sched: Allocate JobId=3402731
NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
[2021-05-25T00:15:29.043] error: Nodes gpu-t4-4x-ondemand-44 not responding
sinfo -R initially doesn't show it as problematic. Though I see it goes
into:
gpu-t4-4x-ondemand up infinite 1 comp*
gpu-t4-4x-ondemand-44
However, the node where slurmctld is running knows about it:
host gpu-t4-4x-ondemand-44
gpu-t4-4x-ondemand-44.virtual-cluster.local has address 10.4.64.11
and I can log in to the node:
# systemctl status slurmd
\u25cf slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled;
vendor preset: disabled)
Active: active (running) since Tue 2021-05-25 00:12:24 UTC; 48s ago
Process: 1874 ExecStart=/opt/slurm/sbin/slurmd $SLURMD_OPTIONS
(code=exited, status=0/SUCCESS)
Main PID: 1876 (slurmd)
Tasks: 1
Memory: 11.6M
CGroup: /system.slice/slurmd.service
\u2514\u25001876 /opt/slurm/sbin/slurmd -f
/etc/slurm/slurm.conf
May 25 00:12:23 gpu-t4-4x-ondemand-44.virtual-cluster.local systemd[1]:
Starting Slurm node daemon...
May 25 00:12:24 gpu-t4-4x-ondemand-44.virtual-cluster.local systemd[1]:
Can't open PID file /var/run/slurmd.pid (yet?)...ory
May 25 00:12:24 gpu-t4-4x-ondemand-44.virtual-cluster.local systemd[1]:
Started Slurm node daemon.
Hint: Some lines were ellipsized, use -l to show in full.
later:
sinfo:
gpu-t4-4x-ondemand up infinite 1 idle*
gpu-t4-4x-ondemand-44
root at service(eigen2):log# sinfo -R
REASON USER TIMESTAMP NODELIST
Not responding slurm 2021-05-25T00:45:40 gpu-t4-4x-ondemand-44
and slurmctld.log:
[2021-05-25T00:15:29.043] error: Nodes gpu-t4-4x-ondemand-44 not responding
[2021-05-25T00:19:16.397] error: Nodes gpu-t4-4x-ondemand-44 not
responding, setting DOWN
[2021-05-25T00:20:02.092] powering down node gpu-t4-4x-ondemand-44
[2021-05-25T00:20:08.438] error: Nodes gpu-t4-4x-ondemand-44 not
responding, setting DOWN
[2021-05-25T00:25:02.931] powering down node gpu-t4-4x-ondemand-44
[2021-05-25T00:25:04.903] error: Nodes gpu-t4-4x-ondemand-44 not
responding, setting DOWN
[2021-05-25T00:30:01.247] powering down node gpu-t4-4x-ondemand-44
[2021-05-25T00:31:21.479] error: Nodes gpu-t4-4x-ondemand-44 not
responding, setting DOWN
[2021-05-25T00:35:01.359] powering down node gpu-t4-4x-ondemand-44
[2021-05-25T00:35:41.756] error: Nodes gpu-t4-4x-ondemand-44 not
responding, setting DOWN
[2021-05-25T00:40:01.671] powering down node gpu-t4-4x-ondemand-44
[2021-05-25T00:40:41.225] error: Nodes gpu-t4-4x-ondemand-44 not
responding, setting DOWN
[2021-05-25T00:45:01.430] powering down node gpu-t4-4x-ondemand-44
[2021-05-25T00:45:40.071] error: Nodes gpu-t4-4x-ondemand-44 not
responding, setting DOWN
This makes sense given what it thinks the state is. However, it's
unclear why it thinks it's non-responding given that slurmd is running
and that it can be logged into.
Herc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210604/837e0a9b/attachment.htm>
More information about the slurm-users
mailing list