Unable to run sequential jobs simultaneously on the same node

List overview All Threads
Download

newer

older

Access to --constraint=<arg> in...

sreport syntax for TRES/GPU usage

Arko Roy

17 Aug 2024 17 Aug '24

8:49 p.m.

I want to run 50 sequential jobs (essentially 50 copies of the same code with different input parameters) on a particular node. However, as soon as one of the jobs gets executed, the other 49 jobs get killed immediately with exit code 9. The jobs are not interacting and are strictly parallel. However, if the 50 jobs run on 50 different nodes, it runs successfully. Can anyone please help with possible fixes? I see a discussion almost along the similar lines in https://groups.google.com/g/slurm-users/c/I1T6GWcLjt4 But could not get the final solution.

-- Arko Roy Assistant Professor School of Physical Sciences Indian Institute of Technology Mandi Kamand, Mandi Himachal Pradesh - 175 005, India Email: arko@iitmandi.ac.in Web: https://faculty.iitmandi.ac.in/~arko/

Attachments:

attachment.html (text/html — 1.4 KB)

Show replies by date

Loris Bennett

19 Aug 19 Aug

6:54 a.m.

Dear Arko,

Arko Roy via slurm-users slurm-users@lists.schedmd.com writes:

...

I want to run 50 sequential jobs (essentially 50 copies of the same code with different input parameters) on a particular node. However, as soon as one of the jobs gets executed, the other 49 jobs get killed immediately with exit code 9. The jobs are not interacting and are strictly parallel. However, if the 50 jobs run on 50 different nodes, it runs successfully. Can anyone please help with possible fixes? I see a discussion almost along the similar lines in https://groups.google.com/g/slurm-users/c/I1T6GWcLjt4 But could not get the final solution.

If the jobs are independent, why do you want to run them all on the same node?

If you do have problems when jobs run on the same node, there may be an issue with the jobs all trying to access a single resource, such as a file. However, you probably need to show your job script in order for anyone to be able to work out what is going on.

Regards

Loris

...

-- Arko Roy Assistant Professor School of Physical Sciences Indian Institute of Technology Mandi Kamand, Mandi Himachal Pradesh - 175 005, India Email: arko@iitmandi.ac.in Web: https://faculty.iitmandi.ac.in/~arko/

-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin

Arko Roy

7:50 a.m.

Thanks Loris and Gareth. here is the job submission script. if you find any errors please let me know. since i am not the admin but just an user, i think i dont have access to the prolog and epilogue files.

If the jobs are independent, why do you want to run them all on the same node? I am running sequential codes. Essentially 50 copies of the same node with a variation in parameter. Since I am using the Slurm scheduler, the nodes and cores are allocated depending upon the available resources. So there are instances, when 20 of them goes to 20 free cores located on a particular node and the rest 30 goes to the free 30 cores on another node. It turns out that only 1 job out of 20 and 1 job out of 30 are completed succesfully with exitcode 0 and the rest gets terminated with exitcode 9. for information, i run sjobexitmod -l jobid to check the exitcodes.

---------------------------------- the submission script is as follows:

#!/bin/bash ################ # Setting slurm options ################

# lines starting with "#SBATCH" define your jobs parameters # requesting the type of node on which to run job ##SBATCH --partition <patition name> #SBATCH --partition=standard

# telling slurm how many instances of this job to spawn (typically 1)

##SBATCH --ntasks <number> ##SBATCH --ntasks=1 #SBATCH --nodes=1 ##SBATCH -N 1 ##SBATCH --ntasks-per-node=1

# setting number of CPUs per task (1 for serial jobs)

##SBATCH --cpus-per-task <number>

##SBATCH --cpus-per-task=1

# setting memory requirements

##SBATCH --mem-per-cpu <memory in MB> #SBATCH --mem-per-cpu=1G

# propagating max time for job to run

##SBATCH --time days-hours:minute:seconds ##SBATCH --time hours:minute:seconds ##SBATCH --time <minutes> #SBATCH --time 10:0:0 #SBATCH --job-name gstate

#module load compiler/intel/2018_4 module load fftw-3.3.10-intel-2021.6.0-ppbepka echo "Running on $(hostname)" echo "We are in $(pwd)"

################ # run the program ################ /home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out &

Loris Bennett

8:13 a.m.

Dear Arko,

Arko Roy arko@iitmandi.ac.in writes:

...

Thanks Loris and Gareth. here is the job submission script. if you find any errors please let me know. since i am not the admin but just an user, i think i dont have access to the prolog and epilogue files.

If the jobs are independent, why do you want to run them all on the same node? I am running sequential codes. Essentially 50 copies of the same node with a variation in parameter. Since I am using the Slurm scheduler, the nodes and cores are allocated depending upon the available resources. So there are instances, when 20 of them goes to 20 free cores located on a particular node and the rest 30 goes to the free 30 cores on another node. It turns out that only 1 job out of 20 and 1 job out of 30 are completed succesfully with exitcode 0 and the rest gets terminated with exitcode 9. for information, i run sjobexitmod -l jobid to check the exitcodes.

the submission script is as follows:

#!/bin/bash ################ # Setting slurm options ################

# lines starting with "#SBATCH" define your jobs parameters # requesting the type of node on which to run job ##SBATCH --partition <patition name> #SBATCH --partition=standard

# telling slurm how many instances of this job to spawn (typically 1)

##SBATCH --ntasks <number> ##SBATCH --ntasks=1 #SBATCH --nodes=1 ##SBATCH -N 1 ##SBATCH --ntasks-per-node=1

# setting number of CPUs per task (1 for serial jobs)

##SBATCH --cpus-per-task <number>

##SBATCH --cpus-per-task=1

# setting memory requirements

##SBATCH --mem-per-cpu <memory in MB> #SBATCH --mem-per-cpu=1G

# propagating max time for job to run

##SBATCH --time days-hours:minute:seconds ##SBATCH --time hours:minute:seconds ##SBATCH --time <minutes> #SBATCH --time 10:0:0 #SBATCH --job-name gstate

#module load compiler/intel/2018_4 module load fftw-3.3.10-intel-2021.6.0-ppbepka echo "Running on $(hostname)" echo "We are in $(pwd)"

################ # run the program ################ /home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out &

You should not write

at the end of the above command. This will run your program in the background, which will cause the submit script to terminate, which in turn will terminate your job.

Regards

Loris

-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin

Arko Roy

8:21 a.m.

Dear Loris,

I just checked removing the & it didn't work.

On Mon, Aug 19, 2024 at 1:43 PM Loris Bennett loris.bennett@fu-berlin.de wrote:

...

Dear Arko,

Arko Roy arko@iitmandi.ac.in writes:

...
Thanks Loris and Gareth. here is the job submission script. if you find

any errors please let me know.

...
since i am not the admin but just an user, i think i dont have access to

the prolog and epilogue files.

...
If the jobs are independent, why do you want to run them all on the same node? I am running sequential codes. Essentially 50 copies of the same node

with a variation in parameter.

...
Since I am using the Slurm scheduler, the nodes and cores are allocated

depending upon the

...
available resources. So there are instances, when 20 of them goes to 20

free cores located on a particular

...
node and the rest 30 goes to the free 30 cores on another node. It turns

out that only 1 job out of 20 and 1 job

...
out of 30 are completed succesfully with exitcode 0 and the rest gets

terminated with exitcode 9.

...
for information, i run sjobexitmod -l jobid to check the exitcodes.

the submission script is as follows:

#!/bin/bash ################ # Setting slurm options ################

# lines starting with "#SBATCH" define your jobs parameters # requesting the type of node on which to run job ##SBATCH --partition <patition name> #SBATCH --partition=standard

# telling slurm how many instances of this job to spawn (typically 1)

##SBATCH --ntasks <number> ##SBATCH --ntasks=1 #SBATCH --nodes=1 ##SBATCH -N 1 ##SBATCH --ntasks-per-node=1

# setting number of CPUs per task (1 for serial jobs)

##SBATCH --cpus-per-task <number>

##SBATCH --cpus-per-task=1

# setting memory requirements

##SBATCH --mem-per-cpu <memory in MB> #SBATCH --mem-per-cpu=1G

# propagating max time for job to run

##SBATCH --time days-hours:minute:seconds ##SBATCH --time hours:minute:seconds ##SBATCH --time <minutes> #SBATCH --time 10:0:0 #SBATCH --job-name gstate

#module load compiler/intel/2018_4 module load fftw-3.3.10-intel-2021.6.0-ppbepka echo "Running on $(hostname)" echo "We are in $(pwd)"

################ # run the program ################ /home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out &

You should not write

&

at the end of the above command. This will run your program in the background, which will cause the submit script to terminate, which in turn will terminate your job.

Regards

Loris

-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin

Davide DelVento

12:55 p.m.

Since each instance of the program is independent and you are using one core for each, it'd be better to leave slurm deal with that and schedule them concurrently as it sees fit. Maybe you simply need to add some directive to allow shared jobs on the same node. Alternatively (if at your site jobs must be exclusive) you have to check what it is their recommended way to perform this. Some sites prefer dask, some other an MPI-based serial-job consolidation (often called "command file") some others a technique similar to what you are doing, but instead of reinventing the wheel I suggest to check what your site recommends in this situation

On Mon, Aug 19, 2024 at 2:24 AM Arko Roy via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

Dear Loris,

I just checked removing the & it didn't work.

On Mon, Aug 19, 2024 at 1:43 PM Loris Bennett loris.bennett@fu-berlin.de wrote:

...
Dear Arko,

Arko Roy arko@iitmandi.ac.in writes:

...
Thanks Loris and Gareth. here is the job submission script. if you find

any errors please let me know.

...
since i am not the admin but just an user, i think i dont have access

to the prolog and epilogue files.

...
If the jobs are independent, why do you want to run them all on the same node? I am running sequential codes. Essentially 50 copies of the same node

with a variation in parameter.

...
Since I am using the Slurm scheduler, the nodes and cores are allocated

depending upon the

...
available resources. So there are instances, when 20 of them goes to 20

free cores located on a particular

...
node and the rest 30 goes to the free 30 cores on another node. It

turns out that only 1 job out of 20 and 1 job

...
out of 30 are completed succesfully with exitcode 0 and the rest gets

terminated with exitcode 9.

...
for information, i run sjobexitmod -l jobid to check the exitcodes.

the submission script is as follows:

#!/bin/bash ################ # Setting slurm options ################

# lines starting with "#SBATCH" define your jobs parameters # requesting the type of node on which to run job ##SBATCH --partition <patition name> #SBATCH --partition=standard

# telling slurm how many instances of this job to spawn (typically 1)

##SBATCH --ntasks <number> ##SBATCH --ntasks=1 #SBATCH --nodes=1 ##SBATCH -N 1 ##SBATCH --ntasks-per-node=1

# setting number of CPUs per task (1 for serial jobs)

##SBATCH --cpus-per-task <number>

##SBATCH --cpus-per-task=1

# setting memory requirements

##SBATCH --mem-per-cpu <memory in MB> #SBATCH --mem-per-cpu=1G

# propagating max time for job to run

##SBATCH --time days-hours:minute:seconds ##SBATCH --time hours:minute:seconds ##SBATCH --time <minutes> #SBATCH --time 10:0:0 #SBATCH --job-name gstate

#module load compiler/intel/2018_4 module load fftw-3.3.10-intel-2021.6.0-ppbepka echo "Running on $(hostname)" echo "We are in $(pwd)"

################ # run the program ################ /home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out &

You should not write

&

at the end of the above command. This will run your program in the background, which will cause the submit script to terminate, which in turn will terminate your job.

Regards

Loris

-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Brian Andrus

5:31 p.m.

IIRC, slurm parses the batch file as options until it hits the first non-comment line, which includes blank lines.

You may want to double-check some of the gaps in the option section of your batch script.

That being said and you say you removed the '&' at the end of the command, which would help.

If they are all exiting with exit code 9, you need to look at the code for your a.out to see what code 9 means, as that is who is raising that error. Slurm sees that and if it is non-zero, it interprets it as a failed job.

Brian Andrus

On 8/19/2024 12:50 AM, Arko Roy via slurm-users wrote:

...

Thanks Loris and Gareth. here is the job submission script. if you find any errors please let me know. since i am not the admin but just an user, i think i dont have access to the prolog and epilogue files.

If the jobs are independent, why do you want to run them all on the same node? I am running sequential codes. Essentially 50 copies of the same node with a variation in parameter. Since I am using the Slurm scheduler, the nodes and cores are allocated depending upon the available resources. So there are instances, when 20 of them goes to 20 free cores located on a particular node and the rest 30 goes to the free 30 cores on another node. It turns out that only 1 job out of 20 and 1 job out of 30 are completed succesfully with exitcode 0 and the rest gets terminated with exitcode 9. for information, i run sjobexitmod -l jobid to check the exitcodes.

the submission script is as follows:

#!/bin/bash ################ # Setting slurm options ################

# lines starting with "#SBATCH" define your jobs parameters # requesting the type of node on which to run job ##SBATCH --partition <patition name> #SBATCH --partition=standard

# telling slurm how many instances of this job to spawn (typically 1)

##SBATCH --ntasks <number> ##SBATCH --ntasks=1 #SBATCH --nodes=1 ##SBATCH -N 1 ##SBATCH --ntasks-per-node=1

# setting number of CPUs per task (1 for serial jobs)

##SBATCH --cpus-per-task <number>

##SBATCH --cpus-per-task=1

# setting memory requirements

##SBATCH --mem-per-cpu <memory in MB> #SBATCH --mem-per-cpu=1G

# propagating max time for job to run

##SBATCH --time days-hours:minute:seconds ##SBATCH --time hours:minute:seconds ##SBATCH --time <minutes> #SBATCH --time 10:0:0 #SBATCH --job-name gstate

#module load compiler/intel/2018_4 module load fftw-3.3.10-intel-2021.6.0-ppbepka echo "Running on $(hostname)" echo "We are in $(pwd)"

################ # run the program ################ /home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out &

Bjørn-Helge Mevik

20 Aug 20 Aug

6:24 a.m.

Brian Andrus via slurm-users slurm-users@lists.schedmd.com writes:

...

IIRC, slurm parses the batch file as options until it hits the first non-comment line, which includes blank lines.

Blank lines do not stop sbatch from parsing the file. (But commands do.)

-- B/H

354

Age (days ago)

357

Last active (days ago)

slurm-users@lists.schedmd.com

7 comments

5 participants

tags (0)

participants (5)

Arko Roy
Bjørn-Helge Mevik
Brian Andrus
Davide DelVento
Loris Bennett