[slurm-users] The 8 second default for sbatch's --get-user-env: is it "the only default"

Kevin Buckley Kevin.Buckley at pawsey.org.au
Tue Mar 15 07:40:03 UTC 2022


We have a group of users who occasionally report seeing jobs start without,
for example, $HOME being set.

Looking at the slurmd logs (info level) on our 20.11.8 node, shows the first
instance of an afflicted JobID appearing as

[2022-03-11T00:19:35.117] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: <JobID>

but then notes Slurm becoming aware that it couldn't get a user environment at

[2022-03-11T00:21:36.254] error: Failed to load current user environment variables
[2022-03-11T00:21:36.254] error: _get_user_env: Unable to get user's local environment, running only with passed environment
[2022-03-11T00:21:36.254] Launching batch job <JobID> for UID <UID>

so that's 2 minutes.

I'm not aware of "us", ie, us on the system's side, nor the users in question,
overridng what the sbatch man page says is the

   --get-user-env[=timeout][mode]

timeout default, of 8 seconds, anywhere.

Is it possible that, if the sbatch option is invoked at all, there's a "fallback"
timeout value that get "inherited" into what then appears to be athe option specific
timeout, although even then the only 120 seconds we have in the config is:

SlurmctldTimeout        = 120 sec

and I'm thinking that it's the job on the node, so under the control of the
SlurmD, for which the timeout is 300 sec, and not the SlurmCtld, that's
waiting for the user-env ?

I'd like to suggest that the afflicted members of our user community try using a

--get-user-env=timeout

with a "larger" figure, just to be on the safe side, but my "8 seconds" vs
"2 minutes"  observation has got me wondering where, in time, a "safe side"
might need to start, or whether I am missing something else entirely.

As usual, any clues/pointers welcome,
Kevin

-- 
Supercomputing Systems Administrator
Pawsey Supercomputing Centre



More information about the slurm-users mailing list