Mike via slurm-users slurm-users@lists.schedmd.com writes:
Greetings,
We are new to Slurm and we are trying to better understand why we’re seeing high-mem jobs stuck in Pending state indefinitely. Smaller (mem) jobs in the queue will continue to pass by the high mem jobs even when we bump priority on a pending high-mem job way up. We have been reading over the backfill scheduling page and what we think we're seeing is that we need to require that users specify a --time parameter on their jobs so that Backfill works properly. None of our users specify a --time param because we have never required it. Is that what we need to require in order to fix this situation? From the backfill page: "Backfill scheduling is difficult without reasonable time limit estimates for jobs, but some configuration parameters that can help" and it goes on to list some config params that we have not set (DefaultTime, MaxTime, OverTimeLimit). We also see language such as, “Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are important for backfill scheduling to work well.” So we suspect that we can achieve proper backfill scheduling by requiring that all users supply a "--time" parameter via a job submit plugin. Would that be a fair statement?
You might also need to look at the configuration parameter
SchedulerParameters
in particular
bf_window=# The number of minutes into the future to look when considering jobs to schedule. Higher values result in more overhead and less respon‐ siveness. A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation. In order to limit the amount of data managed by the backfill scheduler, if the value of bf_window is increased, then it is generally advisable to also increase bf_resolution. This option applies only to Scheduler‐ Type=sched/backfill. Default: 1440 (1 day), Min: 1, Max: 43200 (30 days).
Regards
Loris Bennett