[slurm-users] slurm cluster error - bad node index

Fri Oct 27 21:03:41 UTC 2023

Hi -

Very delayed response to this, as I'm working my way through a backlog 
of slurm-user posts. If this error is intermittent, it's likely a 
hardware issue.  Recently I ran into an problem where a host with 8 GPUs 
was spontaneously rebooting a couple of minutes after a user would start 
an 8 GPU process. The same task on the same machine with 7 GPUs worked 
fine. It turned out that one of the power supplies was partially 
unseated, and this was causing the problem. Once the power supply was 
properly re-installed, the problem went away.

On 2/22/23 01:38, Eunsong Goh wrote:
> Hi,
> 
> I'm using slurm cluster with one master node and two worker node.
> When I run sbatch job with SBATCH parameter shell file, job allocated 
> each node.
> With nvidia-smi command, I figure out job was appropriately allocated 
> two node.
> But mean a while, one node job suddenly killed and occured error 
> mesage in /var/log/slurm/slurmd.log as below
> After that, job continued remainded one node.
> 
> What is the problem?
> 
> [2023-02-22T07:12:54.214] pyxis: version v0.11.1
> [2023-02-22T07:12:54.215] slurmd version 22.05.2 started
> [2023-02-22T07:12:54.226] slurmd started on Wed, 22 Feb 2023 07:12:54 +0000
> [2023-02-22T07:12:54.299] CPUs=40 Boards=1 Sockets=2 Cores=10 Threads=2 
> Memory=257609 TmpDisk=100220 Uptime=104407 CPUSpecList=(null) 
> FeaturesAvail=(null) FeaturesActive=(null)
> [2023-02-22T07:13:53.613] error: bad node index (-1 > 1)
> [2023-02-22T07:14:00.966] epilog for job 1056 ran for 7 seconds
> [2023-02-22T07:16:08.691] task/affinity: task_p_slurmd_batch_request: 
> task_p_slurmd_batch_request: 127
> [2023-02-22T07:16:08.692] task/affinity: batch_bind: job 127 CPU input 
> mask for node: 0x0000000003
> [2023-02-22T07:16:08.692] task/affinity: batch_bind: job 127 CPU final 
> HW mask for node: 0x0000100001
> [2023-02-22T07:16:08.951] [127.extern] task/cgroup: _memcg_initialize: 
> job: alloc=0MB mem.limit=257609MB memsw.limit=unlimited
> [2023-02-22T07:16:08.951] [127.extern] task/cgroup: _memcg_initialize: 
> step: alloc=0MB mem.limit=257609MB memsw.limit=unlimited
> [2023-02-22T07:16:08.956] Launching batch job 127 for UID 1002
> [2023-02-22T07:16:08.976] [127.batch] task/cgroup: _memcg_initialize: 
> job: alloc=0MB mem.limit=257609MB memsw.limit=unlimited
> [2023-02-22T07:16:08.976] [127.batch] task/cgroup: _memcg_initialize: 
> step: alloc=0MB mem.limit=257609MB memsw.limit=unlimited
> [2023-02-22T07:16:09.036] launch task StepId=127.0 request from UID:1002 
> GID:1003 HOST:27.122.137.19 PORT:32996
> [2023-02-22T07:16:09.036] task/affinity: lllp_distribution: JobId=127 
> implicit auto binding: cores, dist 8192
> [2023-02-22T07:16:09.036] task/affinity: _task_layout_lllp_block: 
> _task_layout_lllp_block
> [2023-02-22T07:16:09.036] task/affinity: _lllp_generate_cpu_bind: 
> _lllp_generate_cpu_bind jobid [127]: mask_cpu, 0x0000100001
> [2023-02-22T07:16:09.148] [127.0] task/cgroup: _memcg_initialize: job: 
> alloc=0MB mem.limit=257609MB memsw.limit=unlimited
> [2023-02-22T07:16:09.149] [127.0] task/cgroup: _memcg_initialize: step: 
> alloc=0MB mem.limit=257609MB memsw.limit=unlimited
> [2023-02-22T07:16:53.880] error: bad node index (-1 > 1)
> [2023-02-22T07:16:55.569] [127.0] get_exit_code task 0 died by signal: 9
> [2023-02-22T07:16:55.578] [127.0] done with job
> [2023-02-22T07:17:01.412] epilog for job 1056 ran for 8 seconds
> [2023-02-22T07:19:35.335] [127.batch] done with job
> [2023-02-22T07:19:35.350] [127.extern] done with job
> [2023-02-22T07:19:53.067] error: bad node index (-1 > 1)
> [2023-02-22T07:20:00.154] epilog for job 1056 ran for 7 seconds
> [2023-02-22T07:22:53.294] error: bad node index (-1 > 1)
> [2023-02-22T07:23:00.463] epilog for job 1056 ran for 7 seconds
> [2023-02-22T07:25:53.485] error: bad node index (-1 > 1)
> [2023-02-22T07:26:00.612] epilog for job 1056 ran for 7 seconds
> [2023-02-22T07:28:53.709] error: bad node index (-1 > 1)
> [2023-02-22T07:29:00.841] epilog for job 1056 ran for 7 seconds
> [2023-02-22T07:31:53.975] error: bad node index (-1 > 1)
> [2023-02-22T07:32:01.083] epilog for job 1056 ran for 8 seconds
> [2023-02-22T07:34:53.165] error: bad node index (-1 > 1)
> 
> This message is from an external sender. Learn more about why this 
> matters. 
> <https://ut.service-now.com/sp?id=kb_article&number=KB0011401>
> 
>