[slurm-users] ticking time bomb? launching too many jobs in parallel

Jarno van der Kolk jvanderk at uottawa.ca
Thu Aug 29 18:29:56 UTC 2019


On 8/29/19 12:48 PM, Goetz, Patrick G wrote:
> On 8/29/19 9:38 AM, Jarno van der Kolk wrote:
> > Here's an example on how to do so from the Compute Canada docs:
> > 
> https://docs.computecanada.ca/wiki/GNU_Parallel#Running_on_Multiple_Nodes
> >
> 
> [name at server ~]$ parallel --jobs 32 --sshloginfile
> ./node_list_${SLURM_JOB_ID} --env MY_VARIABLE --workdir $PWD ./my_program
> 
> 
> To me it looks like you're circumventing the scheduler when you do this;
> maybe I'm missing something?
> 
> Also, where are these environment variables:
> 
>    SLURM_JOB_NODELIST, SLURM_JOB_ID
> 
> being set?
> 

I guess you kind of are. The advantage of this over array jobs is that you can provide a list of jobs instead on depending on SLURM_ARRAY_TASK_ID while still only doing one submission to the scheduler. So instead of submitting hundreds or even thousands of little jobs and waiting for the scheduler to accept them all, you submit once and are done. So parallel functions as a subscheduler if you will.

Those environment variables are set when the job starts.
See also https://slurm.schedmd.com/sbatch.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES

Regards,
Jarno

Jarno van der Kolk, PhD Phys.
Analyste principal en informatique scientifique | Senior Scientific Computing Specialist
Solutions TI | IT Solutions
Université d’Ottawa | University of Ottawa



More information about the slurm-users mailing list