[slurm-users] error: power_save module disabled, NULL SuspendProgram

Mon Apr 3 18:50:55 UTC 2023

Am Wed, 29 Mar 2023 15:51:51 +0200
schrieb Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>:

> As for job scheduling, slurmctld may allocate a job to some powered-off 
> nodes and then calls the ResumeProgram defined in slurm.conf.  From this 
> point it may indeed take 2-3 minutes before a node is up and running 
> slurmd, during which time it will have a state of POWERING_UP (see "man 
> sinfo").  If this doesn't complete after ResumeTimeout seconds, the node 
> will go into a failed state.  All this logic seems to be working well.

I must admit that I struggle to find a definite example in the slurmctl
logs of where the ResumeTimeout was just broken. I remember that the
node was recognized suspiciously right just after the timeout passed.

But what we definitely had, was the issue that a node would _not_ get
back up again for some reason. Slurmctld can only wait for the timeout
and then tell the job that it is screwed. With my external scheme, the
job can be scheduled on any number of other nodes that are becoming
available.

> I think we had power_save with IPMI working also in Slurm 21.08 before we 
> upgraded to 22.05.

Good to read … so … either I had some specialsituation or misread the
logs after all. I'm trying to dig again now to see a case. I got a
bunch of them early on …

[2022-10-03T23:22:46.664] node node329 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.664] node node330 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.664] node node331 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.665] node node332 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.665] node node333 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.665] node node334 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.665] node node335 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.665] node node336 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.666] node node337 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.666] node node338 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.666] node node339 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.666] node node340 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.666] node node341 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.667] node node342 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.667] node node343 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.667] node node344 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.667] node node345 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.669] node node346 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.670] node node347 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.670] node node348 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.670] node node349 not resumed by ResumeTimeout(900) - marking down and power_save
[2022-10-03T23:22:46.676] Node node347 now responding
[2022-10-03T23:22:46.676] Node node345 now responding
[2022-10-03T23:22:46.676] Node node346 now responding
[2022-10-03T23:22:46.676] Node node349 now responding
[2022-10-03T23:22:46.676] Node node348 now responding
[2022-10-03T23:22:46.676] Node node329 now responding
[2022-10-03T23:22:46.693] Node node335 now responding
[2022-10-03T23:22:46.712] Node node332 now responding
[2022-10-03T23:22:46.712] Node node330 now responding
[2022-10-03T23:22:46.712] Node node333 now responding
[2022-10-03T23:22:46.713] Node node334 now responding
[2022-10-03T23:22:46.713] Node node338 now responding
[2022-10-03T23:22:46.713] Node node339 now responding
[2022-10-03T23:22:46.713] Node node336 now responding
[2022-10-03T23:22:46.713] Node node337 now responding
[2022-10-03T23:22:46.713] Node node340 now responding
[2022-10-03T23:22:46.713] Node node341 now responding
[2022-10-03T23:22:46.713] Node node343 now responding
[2022-10-03T23:22:46.713] Node node331 now responding
[2022-10-03T23:22:46.713] Node node344 now responding

So, quite a coincidence of node333 and friends timing out after 15 minutes and then
responding within one second.

I also got a more isolated occurence:

[2022-10-06T18:02:02.464] node node327 not resumed by ResumeTimeout(300) - marking down and power_save
[2022-10-06T18:02:47.441] Node node327 now responding

Or, later, probably with the then-final version of the powersaving
triggered by slurmctld, with more log lines around the event:

[2022-11-18T18:37:56.437] node node355 not resumed by ResumeTimeout(300) - marking down and power_save
[2022-11-18T18:37:56.437] requeue job JobId=2522959_9(2523030) due to failure of node node355
[2022-11-18T18:37:56.437] Requeuing JobId=2522959_9(2523030)
[2022-11-18T18:38:07.290] Node node355 now responding
[2022-11-18T18:38:07.290] node node355 returned to service
[2022-11-18T18:39:08.125] error: Nodes node355 not responding
[2022-11-18T18:39:52.551] _job_complete: JobId=2522959_24(2523045) WEXITSTATUS 0
[2022-11-18T18:39:52.551] _job_complete: JobId=2522959_24(2523045) done
[2022-11-18T18:40:10.834] sched/backfill: _start_job: Started JobId=2522959_9(2523030) in gpd on node355
[…]
[2022-11-18T19:00:02.916] _job_complete: JobId=2522959_9(2523030) WEXITSTATUS 0
[2022-11-18T19:00:02.916] _job_complete: JobId=2522959_9(2523030) done

Here, the job got requeued. So maybe mainly unnecessary noise.

I see that the failure to start the job occured at 18:37:56.437, while
the node came back 10 seconds later. Then went out a minute later, came
back … maybe network weirdness.

Ah, digging in my own messages to the users, I see that we had two
distinct issues:

1. Jobs being aborted with NODE_FAIL because: “batch system asks to
resume nodes that aren't actally suspended. This triggered a workaround
[for] issues during powerdown, doing a power cycle on them.”

2. Requeued jobs like the above.

So, point 1 was a bad interaction of slurmctld resume script with my
mechanism of reboot. For $reasons, nodes could get stuck on shutdown
(boot, actually), and my resume script cycled the power to ensure a
fresh start if a node is powered on. I then modified it to only do that
if there is no live instance of slurmd on the node. I never figured out
why a resume request got issued for nodes that are not down, but it
seems to be a thing that supposedly can happen.

Point 2 could be valid behaviour for nodes that just need a bit too
long … or spuriously triggered by slurmd messages for getting back into
service having congestion issues. I did not investigate this further,
rather avoided this tight coupling of a reboots of possibly flaky
hardware (ages 6 and up) and scheduled jobs.

Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg