[slurm-users] multi-process/thread jobs:: configuration and job specification

Adrian Sevcenco Adrian.Sevcenco at spacescience.ro
Wed Sep 29 12:03:18 UTC 2021


Hi! I'm trying to prepare and test for some jobs that will arrive and that will
use multiple processes (i have no control on this, there are multiple executables
that are being started in parallel within the job and communicate between them with
a customization of zmq)

the submitting method is that a sbatch is submitted on my site (generated by a local running Compute Element (CE) 
service) and it has a srun declaration that runs a given script.

for testing i'm trying to use the same format of the sbatch script but with the additions to srun
line of:
--ntasks=1 --cpus-per-task=8

the result so far is that i have no errors but also the job is not run (the payload is just some echo)
and i have no time to see it in the queue

this is on :
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory,CR_LLN

with
TaskPlugin=task/affinity,task/cgroup
TaskPluginParam=autobind=threads

So, for the above presented scenario, what are the site-side settings to be aware of/take care
and what settings should be in the sbatch/srun components that i can ask to the experiment
to adhere to?

should i ask that the request of resources to be instead in the sbatch file or command?

in a test where the test job stayed in the queue wainting for execution, info of the job
shows:
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=3950M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

then, after execution i have no output, not even the stdout and stderr files.

the sacct shows just:
aliprod at alien: job_test $ sacct -j 8322339
JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
8322339      TEST_JOB_+      alien    aliprod          1     FAILED      1:0
8322339.bat+      batch               aliprod          1     FAILED      1:0
8322339.ext+     extern               aliprod          1  COMPLETED      0:0


the slurmctld log shows no info

Any idea how can i debug this further?

Thanks a lot for info!!
Adrian



More information about the slurm-users mailing list