Dear Arko,
Arko Roy via slurm-users slurm-users@lists.schedmd.com writes:
I want to run 50 sequential jobs (essentially 50 copies of the same code with different input parameters) on a particular node. However, as soon as one of the jobs gets executed, the other 49 jobs get killed immediately with exit code 9. The jobs are not interacting and are strictly parallel. However, if the 50 jobs run on 50 different nodes, it runs successfully. Can anyone please help with possible fixes? I see a discussion almost along the similar lines in https://groups.google.com/g/slurm-users/c/I1T6GWcLjt4 But could not get the final solution.
If the jobs are independent, why do you want to run them all on the same node?
If you do have problems when jobs run on the same node, there may be an issue with the jobs all trying to access a single resource, such as a file. However, you probably need to show your job script in order for anyone to be able to work out what is going on.
Regards
Loris
-- Arko Roy Assistant Professor School of Physical Sciences Indian Institute of Technology Mandi Kamand, Mandi Himachal Pradesh - 175 005, India Email: arko@iitmandi.ac.in Web: https://faculty.iitmandi.ac.in/~arko/