[slurm-users] Power Saving Issue - Job B is executed before Job A - node not ready?

Eg. Bo. egonle at aol.com
Wed Dec 23 13:37:05 UTC 2020


Hello,
Slurm Power Saving (19.05.) was configured successfuly within our Cloud environment. Jobs can be submitted and nodes get provisioned and deprovisioned as expected. Unfortunately there seems to be an edge case (or config issue :-D).After a job (jobA) is submitted to partition A, node provisioning starts, during that phase another job (jobB) is submitted to the partition including requesting the same node (-w) - not sure if this is really a must have right now. The edge case is based on application job scheduling.
Unfortunately jobB runs before jobA and fails, but few seconds after jobA finishes successfully. Therefore the configuration should be ok - overall.
srun: error: Unable to resolve "mynodename": Host name lookup failuresrun: error: fwd_tree_thread: can't find address for host mynodename check slurm.confsrun: error: Task launch for 123456.0 failed on node mynodename: Can't find an address, check slurm.confsrun: error: Application launch failed: Can't find an address, check slurm.confsrun: Job step aborted: Waiting up to 188 seconds for job step to finish.srun: error: Timed out waiting for job step to complete
It looks like slurmctld applies some magic to jobA (Resetting JobId=jobidA start time for node power up) but not to jobB.
update_node: node mynodename state set to ALLOCATEDNode mynodename2 now respondingNode mynodename now respondingupdate_node: node mynodename state set to ALLOCATED_pick_step_nodes: Configuration for JobId=jobidB is completejob_step_signal: JobId=jobidB StepId=0 not found_pick_step_nodes: Configuration for JobId=jobidA is completeResetting JobId=jobidA start time for node power up_job_complete: JobId=jobidA WEXITSTATUS 0_job_complete: JobId=jobidA donejob_step_signal: JobId=jobidB StepId=0 not found_job_complete: JobId=jobidB WTERMSIG 116_job_complete: JobId=jobidB done

Has anyone seen this before or any idea how to fix it?


Thanks & Best
Eg. Bo.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201223/7e329a78/attachment.htm>


More information about the slurm-users mailing list