[slurm-users] Slurm memory error in child process of a system() function call after malloc()
Péter Nagy
nagyrpeter at gmail.com
Tue Jan 8 12:33:44 MST 2019
Dear Users,
our FORTRAN based code uses shell operations via either the system()
function or calling the corresponding system subroutine. This fails with
Slurm in certain cases. For instance, input files are manipulated as
istatus=system('cp file1 file2').
When using Slurm scheduler system() returns -1 (or 255 if unsigned), and
the requested shell operation is not performed if the system() call follows
a malloc() operation allocating more than half of the memory that is
available for the Slurm job. Unfortunately, there is no error message in
the error output or the slurm output file.
Our code, via a C interface, allocates all the available memory at once via
that single malloc() operation and works with that allocated array during
the entire runtime. All system() function calls which precede malloc() are
performed correctly, and all system() function calls fail starting from
right after malloc().
If less than half of the slurm job's memory limit is allocated with
malloc() then all system() function calls are performed perfectly.
I have tried to set the memory limit either by --mem-per-cpu or by --mem. I
also tried --mem=0 together with --exclusive.
I have tried different clusters with slurm versions of 14.03.9 and 17.11.12
and several well working FORTRAN compilers and found the same error
consistently.
Performing shell operations with system() also works perfectly on the same
node with full memory without a scheduler. There is no problem either with
SGE, OAR, or condor schedulers irrespective of the allocated memory size.
Our guess is that there might be a Slurm specific setting which does not
allow to fork a shell/child process if more than half of the memory limit
is consumed by the parent job. Slurm might assume that the child process
needs the same amount of memory as the parent and cancels it due to the
slurm job's memory limit.
Unfortunately, I did not find any error message or related error reports
and got stuck here.
Could you, please, help with suggestions how could we utilize the memory up
to the slurm job memory limit?
Thank you very much in advance,
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190108/2594e3e7/attachment.html>
More information about the slurm-users
mailing list