[slurm-users] What is the 2^32-1 values in "stepd_connect to <jobid>.4294967295 failed" telling you
Kevin Buckley
Kevin.Buckley at pawsey.org.au
Fri Mar 8 08:25:08 UTC 2019
We have some SLURM jobs for which the <jobid>.0 process is being killed
by the OS's oom-kill, which makes SLURM come over all wobbly!
After messages akin to :
[<jobid>.extern] _oom_event_monitor: oom-kill event count: 1
[<jobid>.0] done with job
what we then see in the slurmd logs are messages of the form:
error: stepd_connect to <jobid>.1 failed: No such file or directory
error: stepd_connect to <jobid>.4294967295 failed: No such file or directory
We can imagine why a job that got killed in step 0 might still be looking
for the <jobid>.1 step but the <jobid>.2^32-1 is beyond our imagination.
Is that a wraparound of some kind; does SLURM use, internally, negative
step IDs that don't usally enter the public consciousness via its logging,
or is this telling us something else ?
Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
Eml: kevin.buckley at pawsey.org.au
More information about the slurm-users
mailing list