Since each instance of the program is independent and you are using one core for each, it'd be better to leave slurm deal with that and schedule them concurrently as it sees fit. Maybe you simply need to add some directive to allow shared jobs on the same node. Alternatively (if at your site jobs must be exclusive) you have to check what it is their recommended way to perform this. Some sites prefer dask, some other an MPI-based serial-job consolidation (often called "command file") some others a technique similar to what you are doing, but instead of reinventing the wheel I suggest to check what your site recommends in this situation
On Mon, Aug 19, 2024 at 2:24 AM Arko Roy via slurm-users < slurm-users@lists.schedmd.com> wrote:
Dear Loris,
I just checked removing the & it didn't work.
On Mon, Aug 19, 2024 at 1:43 PM Loris Bennett loris.bennett@fu-berlin.de wrote:
Dear Arko,
Arko Roy arko@iitmandi.ac.in writes:
Thanks Loris and Gareth. here is the job submission script. if you find
any errors please let me know.
since i am not the admin but just an user, i think i dont have access
to the prolog and epilogue files.
If the jobs are independent, why do you want to run them all on the same node? I am running sequential codes. Essentially 50 copies of the same node
with a variation in parameter.
Since I am using the Slurm scheduler, the nodes and cores are allocated
depending upon the
available resources. So there are instances, when 20 of them goes to 20
free cores located on a particular
node and the rest 30 goes to the free 30 cores on another node. It
turns out that only 1 job out of 20 and 1 job
out of 30 are completed succesfully with exitcode 0 and the rest gets
terminated with exitcode 9.
for information, i run sjobexitmod -l jobid to check the exitcodes.
the submission script is as follows:
#!/bin/bash ################ # Setting slurm options ################
# lines starting with "#SBATCH" define your jobs parameters # requesting the type of node on which to run job ##SBATCH --partition <patition name> #SBATCH --partition=standard
# telling slurm how many instances of this job to spawn (typically 1)
##SBATCH --ntasks <number> ##SBATCH --ntasks=1 #SBATCH --nodes=1 ##SBATCH -N 1 ##SBATCH --ntasks-per-node=1
# setting number of CPUs per task (1 for serial jobs)
##SBATCH --cpus-per-task <number>
##SBATCH --cpus-per-task=1
# setting memory requirements
##SBATCH --mem-per-cpu <memory in MB> #SBATCH --mem-per-cpu=1G
# propagating max time for job to run
##SBATCH --time days-hours:minute:seconds ##SBATCH --time hours:minute:seconds ##SBATCH --time <minutes> #SBATCH --time 10:0:0 #SBATCH --job-name gstate
#module load compiler/intel/2018_4 module load fftw-3.3.10-intel-2021.6.0-ppbepka echo "Running on $(hostname)" echo "We are in $(pwd)"
################ # run the program ################ /home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out &
You should not write
&
at the end of the above command. This will run your program in the background, which will cause the submit script to terminate, which in turn will terminate your job.
Regards
Loris
-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com