[slurm-users] (no subject)

Sun Nov 5 09:13:53 UTC 2023

I'm having a hard time figuring out the distribution of jobs between 2
clusters in a Slurm multi-cluster environment. The documentation says that
each job is submitted to the cluster that provides the earliest start time,
and once the task is submitted to a cluster, it can't be re-distributed to
another cluster. The file
"<slurm_github+repository>/src/common/slurmdb_defs.c" lists 3 comparison
criteria to choose a suitable cluster: 1) First, it investigates the
cluster with the earliest start time. 2) If the start times of both
clusters are equal, then the cluster with the lower preempt_cnt. 3) If
equal, then the local cluster is chosen.

   - I wonder how the start time is calculated. I tried to deduce it from
   the source code, but I got lost in the code. Is it calculated for each job,
   and the least start_time+job_execution_time for all jobs is chosen as the
   start_time of the cluster?
   - Is it possible for 2 or more jobs to see the same start time of the
   cluster if the jobs are submitted almost simultaneously (i.e., before the
   start time is modified by any task)? because it seems so to me as one
   cluster receives most of the jobs despite the other cluster being much less
   loaded (with faster processors). Besides, sometimes, the 'squeue' shows
   less number of jobs than what is already submitted (by almost 1 job)

Regards

-- 
Mohammed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231105/eb7915dd/attachment.htm>