If I use the sbatch(1) option --export=NONE or wipe the environment with "env -i /usr/bin/sbatch ..." or use --export=NIL then the environment is not properly constructed and I see the message in the /var/log/*slurm* files:
[2024-02-03T11:50:33.052] _get_user_env: get env for user jsu here [2024-02-03T11:52:33.152] timeout waiting for /bin/su to complete [2024-02-03T11:52:34.152] error: Failed to load current user environment variables [2024-02-03T11:52:34.153] error: _get_user_env: Unable to get user's local environment, running only with passed environment
This occurs at 120 seconds not matter if I add --get-user-env=3600 or adjust many slurm.conf time-related parameters. It is easy to reproduce by adding "sleep 100" into a .cshrc file and sbatch(1) the file
#!/bin/csh #SBATCH --export=NONE --propagate=NONE --get-user-env=3600L printenv HOME printenv USER printenv PATH env
I have adjust MANY time-related limits in the slurm.conf file to no avail. When the system is unresponsive or heavily loaded or users have prologues that set up complex environments via module commands (which can be notoriously slow) the jobs are failing or producing errors.
If I configure Slurm so that jobs that timeout requeue instead of running then a user with a slow login setup can submit a large number of jobs and basically close down a cluster because this option not only requeues jobs that fail but puts the node it occurred on in a DRAIN state.
We see this as very dangerous as by defaults jobs proceed to execute even when their environment is not properly constructed.
I can see that "slurmrestd getenv" and the procedure get_user_env(3c) are involved, but a preliminary scan of the code looked like the --get-user-env=NNNN value was being parsed, and I did not see a reason the setup always times out at 120 seconds (at least on my system).
Does anyone know how to get the time allowed to get the default user environment to use the value on the --get-user-env option when no environment is being exported to a job?
This is showing up sporadically and causing intermittent failures that are very confusing and disturbing to the users it occurs with.
Sent with [Proton Mail](https://proton.me/) secure email.