[slurm-users] ticking time bomb? launching too many jobs in parallel
Jarno van der Kolk
jvanderk at uottawa.ca
Thu Aug 29 18:29:56 UTC 2019
On 8/29/19 12:48 PM, Goetz, Patrick G wrote:
> On 8/29/19 9:38 AM, Jarno van der Kolk wrote:
> > Here's an example on how to do so from the Compute Canada docs:
> >
> https://docs.computecanada.ca/wiki/GNU_Parallel#Running_on_Multiple_Nodes
> >
>
> [name at server ~]$ parallel --jobs 32 --sshloginfile
> ./node_list_${SLURM_JOB_ID} --env MY_VARIABLE --workdir $PWD ./my_program
>
>
> To me it looks like you're circumventing the scheduler when you do this;
> maybe I'm missing something?
>
> Also, where are these environment variables:
>
> SLURM_JOB_NODELIST, SLURM_JOB_ID
>
> being set?
>
I guess you kind of are. The advantage of this over array jobs is that you can provide a list of jobs instead on depending on SLURM_ARRAY_TASK_ID while still only doing one submission to the scheduler. So instead of submitting hundreds or even thousands of little jobs and waiting for the scheduler to accept them all, you submit once and are done. So parallel functions as a subscheduler if you will.
Those environment variables are set when the job starts.
See also https://slurm.schedmd.com/sbatch.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES
Regards,
Jarno
Jarno van der Kolk, PhD Phys.
Analyste principal en informatique scientifique | Senior Scientific Computing Specialist
Solutions TI | IT Solutions
Université d’Ottawa | University of Ottawa
More information about the slurm-users
mailing list