[slurm-users] ticking time bomb? launching too many jobs in parallel

Fri Aug 30 03:29:36 UTC 2019

>> Here's an example on how to do so from the Compute Canada docs:
>> https://docs.computecanada.ca/wiki/GNU_Parallel#Running_on_Multiple_Nodes
>
> [name at server ~]$ parallel --jobs 32 --sshloginfile
> ./node_list_${SLURM_JOB_ID} --env MY_VARIABLE --workdir $PWD ./my_program
>
> To me it looks like you're circumventing the scheduler when you do this;
> maybe I'm missing something?

our (ComputeCanada) setup includes slurm_adopt, so if a user sshes to a 
node on which they have resources, any processes get put into the job's 
cgroup.  we don't really care how the user consumes the resources, as long
as it's only what's allocated to their jobs, doesn't interfere with other
users, and is hopefully reasonably efficient.  heck, we configure clusters
with hostbased trust, so it's easy for users to ssh among nodes.

regards,
-- 
Mark Hahn | SHARCnet Sysadmin | hahn at sharcnet.ca | http://www.sharcnet.ca
           | McMaster RHPCS    | hahn at mcmaster.ca | 905 525 9140 x24687
           | Compute/Calcul Canada                | http://www.computecanada.ca