Dear SLUR Users and Administrators,

I am interested in a way to customize the job submission exit statuses (mainly error codes) after the job has already been queued by the SLURM controller. We aim to provide more user-friendly messages and reminders in case of any errors or obstacles (also adjusted to our QoS/account system). 

For example, in the case of exceeding CPU minutes of given QoS (or account) and after the (successful) job submission, we would like to notify the user that his job has been queued (as it should) but won’t start until the CPU minutes limits are increased (and that he should contact the administrators to apply for more resources). Similarly, if the user queued a job that cannot be launched immediately because of exceeding the MaxJobs limit (per user), we would like to also give him some additional message after the srun/sbatch submission. We want to provide such information immediately after the job submission, without the need to check the status using `squeue` by the user. 

In the Job Launch Guide (https://slurm.schedmd.com/job_launch.html) there are distinguished following steps:

1. Call job_submit plugins to modify the request as appropriate

2. Validate that the options are valid for this user (e.g. valid partition name, valid limits, etc.)

3. Determine if this job is the highest priority runnable job, if so then really try to allocate resources for it now, otherwise only validate that it could run if no other jobs existed

4. Determine which nodes could be used for the job. If the feature specification uses an exclusive OR option, then multiple iterations of the selection process below will be required with disjoint sets of nodes

5. Call the select plugin to select the best resources for the request

6. The select plugin will consider network topology and the topology within a node (e.g. sockets, cores, and threads) to select the best resources for the job

7. If the job can not be initiated using available resources and preemption support is configured, the select plugin will also determine if the job can be initiated after preempting lower priority jobs. If so then initiate preemption as needed to start the job.

From my understanding, to achieve our goal one would need to have access to source code or plugin related to point 2 (and some part of point 3). Unfortunately, the job_submit (lua) plugin from point 1 (and the cli_filter plugin as well) cannot be used because it only has access to the information on the parameters of the submitted job and the SLURM partitions (but not the QoS/account usage and their limits).

Is there any way to extend the customization of job submission to include such features?

Best regards,
Sebastian

--
dr inż. Sebastian Sitkiewicz 

Politechnika Wrocławska 
Wrocławskie Centrum Sieciowo-Superkomputerowe
Dział Usług Obliczeniowych
Wyb. Wyspiańskiego 27
50-370 Wrocław 
www.wcss.pl