[slurm-users] multi-process/thread jobs:: configuration and job specification
Adrian Sevcenco
Adrian.Sevcenco at spacescience.ro
Wed Sep 29 12:03:18 UTC 2021
Hi! I'm trying to prepare and test for some jobs that will arrive and that will
use multiple processes (i have no control on this, there are multiple executables
that are being started in parallel within the job and communicate between them with
a customization of zmq)
the submitting method is that a sbatch is submitted on my site (generated by a local running Compute Element (CE)
service) and it has a srun declaration that runs a given script.
for testing i'm trying to use the same format of the sbatch script but with the additions to srun
line of:
--ntasks=1 --cpus-per-task=8
the result so far is that i have no errors but also the job is not run (the payload is just some echo)
and i have no time to see it in the queue
this is on :
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory,CR_LLN
with
TaskPlugin=task/affinity,task/cgroup
TaskPluginParam=autobind=threads
So, for the above presented scenario, what are the site-side settings to be aware of/take care
and what settings should be in the sbatch/srun components that i can ask to the experiment
to adhere to?
should i ask that the request of resources to be instead in the sbatch file or command?
in a test where the test job stayed in the queue wainting for execution, info of the job
shows:
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=3950M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
then, after execution i have no output, not even the stdout and stderr files.
the sacct shows just:
aliprod at alien: job_test $ sacct -j 8322339
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
8322339 TEST_JOB_+ alien aliprod 1 FAILED 1:0
8322339.bat+ batch aliprod 1 FAILED 1:0
8322339.ext+ extern aliprod 1 COMPLETED 0:0
the slurmctld log shows no info
Any idea how can i debug this further?
Thanks a lot for info!!
Adrian
More information about the slurm-users
mailing list