Dear slurm user list,
we make use of elastic cloud computing i.e. node instances are created on demand and are destroyed when they are not used for a certain amount of time. Created instances are set up via Ansible. If more than one instance is requested at the exact same time, Slurm will pass those into the resume script together and one Ansible call will handle all those instances.
However, more often than not workflows will request multiple instances within the same second, but not at the exact same time. This leads to multiple resume script calls and therefore to multiple Ansible calls. This will lead to less clear log files, greater CPU consumption by the multiple running Ansible calls and so on.
What I am looking for is an option to force Slurm to wait a certain amount and then perform a single resume call for all instances within that time frame (let's say 1 second).
Is this somehow possible?
Best regards, Xaver