[slurm-users] What is the 2^32-1 values in "stepd_connect to <jobid>.4294967295 failed" telling you

Kevin Buckley Kevin.Buckley at pawsey.org.au
Fri Mar 8 08:25:08 UTC 2019


We have some SLURM jobs for which the <jobid>.0 process is being killed
by the OS's oom-kill, which makes SLURM come over all wobbly!

After messages akin to :

[<jobid>.extern] _oom_event_monitor: oom-kill event count: 1
[<jobid>.0] done with job

what we then see in the slurmd logs are messages of the form:

error: stepd_connect to <jobid>.1 failed: No such file or directory
error: stepd_connect to <jobid>.4294967295 failed: No such file or directory

We can imagine why a job that got killed in step 0 might still be looking
for the <jobid>.1 step but the <jobid>.2^32-1 is beyond our imagination.

Is that a wraparound of some kind; does SLURM use, internally, negative
step IDs that don't usally enter the public consciousness via its logging,
or is this telling us something else ?

Kevin Buckley
-- 
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
Eml: kevin.buckley at pawsey.org.au



More information about the slurm-users mailing list