<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
I think this should already be fixed in the upcoming release. See:
<a class="moz-txt-link-freetext" href="https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72">https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72</a><br>
<br>
<div class="moz-cite-prefix">On 5/8/18 12:08 PM,
<a class="moz-txt-link-abbreviated" href="mailto:a.vitalis@bioc.uzh.ch">a.vitalis@bioc.uzh.ch</a> wrote:<br>
</div>
<blockquote type="cite"
cite="mid:OF73A183AA.26C2E4A8-ONC1258287.00623F1B-C1258287.0063A082@lotus.uzh.ch"><font
face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif"
size="2">Dear all,<br>
<br>
I tried to debug this with some apparent success (for now).<br>
<br>
If anyone cares:<br>
With the help of gdb inside sbatch, I tracked down the immediate
seg fault to strcmp.<br>
I then hacked src/srun/srun.c with some info statements and
isolated this function as the culprit:<br>
<font face="Default Monospace,Courier New,Courier,monospace">static
void _setup_env_working_cluster(void)<br>
<br>
</font>With my configuration, this routine ended up performing a
strcmp of two NULL pointers, which seg-faults on our system (and
is not language-compliant I would think?). My current
understanding is that this is a slurm bug.<br>
<br>
The issue is rectifiable by simply giving the cluster a name in
slurm.conf (e.g., ClusterName=bla). I am not using slurmdbd btw.<br>
<br>
Hope this helps,<br>
Andreas<br>
<br>
<br>
<font color="#990099">-----"slurm-users" <<a
href="mailto:slurm-users-bounces@lists.schedmd.com"
target="_blank" moz-do-not-send="true">slurm-users-bounces@lists.schedmd.com</a>>
wrote: -----</font>
<div class="iNotesHistory" style="padding-left:5px;">
<div
style="padding-right:0px;padding-left:5px;border-left:solid
black 2px;">To: <a
href="mailto:slurm-users@lists.schedmd.com"
target="_blank" moz-do-not-send="true">slurm-users@lists.schedmd.com</a><br>
From: <a href="mailto:a.vitalis@bioc.uzh.ch"
target="_blank" moz-do-not-send="true">a.vitalis@bioc.uzh.ch</a><br>
Sent by: "slurm-users" <slurm-users-bounces@lists.schedmd.com><br>
Date: 05/08/2018 12:44AM<br>
Subject: [slurm-users] srun seg faults immediately from
within sbatch but not salloc<br>
<br>
<font face="Default Sans
Serif,Verdana,Arial,Helvetica,sans-serif" size="2">Dear
all,<br>
<br>
I am trying to set up a small cluster running slurm on
Ubuntu 16.04.<br>
I installed slurm-17.11.5 along with pmix-2.1.1 on an
NFS-shared partition. Installation seems fine. Munge is
taken from the system package.<br>
Something like this:<br>
<font face="Default Monospace,Courier
New,Courier,monospace">./configure
--prefix=/software/slurm/slurm-17.11.5
--exec-prefix=/software/slurm/Gnu
--with-pmix=/software/pmix --with-munge=/usr
--sysconfdir=/software/slurm/etc</font><br>
<br>
One of the nodes is also the control host and runs both
slurmctld and slurmd (but the issue is there also if
this is not the case). I start daemons manually at the
moment (slurmctld first).<br>
My configuration file looks like this (I removed the
node-specific parts):<br>
<br>
<font face="Default Monospace,Courier
New,Courier,monospace">SlurmdUser=root<br>
#<br>
AuthType=auth/munge<br>
# Epilog=/usr/local/slurm/etc/epilog<br>
FastSchedule=1<br>
JobCompLoc=/var/log/slurm/slurm.job.log<br>
JobCompType=jobcomp/filetxt<br>
JobCredentialPrivateKey=/usr/local/etc/slurm.key<br>
JobCredentialPublicCertificate=/usr/local/etc/slurm.cert<br>
#PluginDir=/usr/local/slurm/lib/slurm<br>
# Prolog=/usr/local/slurm/etc/prolog<br>
SchedulerType=sched/backfill<br>
SelectType=select/linear<br>
SlurmUser=cadmin # this user exists everywhere<br>
SlurmctldPort=7002<br>
SlurmctldTimeout=300<br>
SlurmdPort=7003<br>
SlurmdTimeout=300<br>
SwitchType=switch/none<br>
TreeWidth=50<br>
#<br>
# logging<br>
StateSaveLocation=/var/log/slurm/tmp<br>
SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool<br>
SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid<br>
SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid<br>
SlurmctldLogFile=/var/log/slurm/slurmctld.log<br>
SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h<br>
#<br>
# job settings<br>
MaxTasksPerNode=64<br>
MpiDefault=pmix_v2<br>
<br>
# plugins<br>
TaskPlugin=task/cgroup</font><br>
<br>
<br>
There are no prolog or epilog scripts.<br>
After some fiddling with MPI, I got the system to work
with interactive jobs through salloc (MPI behaves
correctly for jobs occupying one or all of the nodes).
sinfo produces expected results.<br>
However, as soon as I try to submit through sbatch I get
an instantaneous seg fault regardless of executable
(even when there is none specified, i.e., the srun
command is meaningless).<br>
<br>
When I try to monitor slurmd in the foreground (-vvvv
-D), I get something like this:<br>
<br>
<font face="Default Monospace,Courier
New,Courier,monospace">slurmd: debug: Log file
re-opened<br>
slurmd: debug2: hwloc_topology_init<br>
slurmd: debug2: hwloc_topology_load<br>
slurmd: debug: CPUs:64 Boards:1 Sockets:2
CoresPerSocket:16 ThreadsPerCore:2<br>
slurmd: Message aggregation disabled<br>
slurmd: topology NONE plugin loaded<br>
slurmd: route default plugin loaded<br>
slurmd: CPU frequency setting not configured for this
node<br>
slurmd: debug: Resource spec: No specialized cores
configured by default on this node<br>
slurmd: debug: Resource spec: Reserved system memory
limit not configured for this node<br>
slurmd: debug: Reading cgroup.conf file
/software/slurm/etc/cgroup.conf<br>
slurmd: debug2: hwloc_topology_init<br>
slurmd: debug2: hwloc_topology_load<br>
slurmd: debug: CPUs:64 Boards:1 Sockets:2
CoresPerSocket:16 ThreadsPerCore:2<br>
slurmd: debug: Reading cgroup.conf file
/software/slurm/etc/cgroup.conf<br>
slurmd: debug: task/cgroup: loaded<br>
slurmd: debug: Munge authentication plugin loaded<br>
slurmd: debug: spank: opening plugin stack
/software/slurm/etc/plugstack.conf<br>
slurmd: Munge cryptographic signature plugin loaded<br>
slurmd: slurmd version 17.11.5 started<br>
slurmd: debug: Job accounting gather NOT_INVOKED
plugin loaded<br>
slurmd: debug: job_container none plugin loaded<br>
slurmd: debug: switch NONE plugin loaded<br>
slurmd: slurmd started on Mon, 07 May 2018 23:54:31
+0200<br>
slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2
Memory=64062 TmpDisk=187611 Uptime=1827335
CPUSpecList=(null) FeaturesAvail=(null)
FeaturesActive=(null)<br>
slurmd: debug: AcctGatherEnergy NONE plugin loaded<br>
slurmd: debug: AcctGatherProfile NONE plugin loaded<br>
slurmd: debug: AcctGatherInterconnect NONE plugin
loaded<br>
slurmd: debug: AcctGatherFilesystem NONE plugin
loaded<br>
slurmd: debug2: No acct_gather.conf file
(/software/slurm/etc/acct_gather.conf)<br>
slurmd: debug2: got this type of message 4005<br>
slurmd: debug2: Processing RPC:
REQUEST_BATCH_JOB_LAUNCH<br>
slurmd: debug2: _group_cache_lookup_internal: no entry
found for andreas<br>
slurmd: _run_prolog: run job script took usec=5<br>
slurmd: _run_prolog: prolog with lock for job 100 ran
for 0 seconds<br>
slurmd: Launching batch job 100 for UID 1003<br>
slurmd: debug2: got this type of message 6011<br>
slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB<br>
slurmd: debug: _rpc_terminate_job, uid = 1001<br>
slurmd: debug: credential for job 100 revoked<br>
slurmd: debug2: No steps in jobid 100 to send signal
999<br>
slurmd: debug2: No steps in jobid 100 to send signal
18<br>
slurmd: debug2: No steps in jobid 100 to send signal
15<br>
slurmd: debug2: set revoke expiration for jobid 100 to
1525730207 UTS<br>
slurmd: debug2: got this type of message 1008<br>
</font><br>
Here, job 100 would be a submission script with
something like:<br>
<br>
<font face="Default Monospace,Courier
New,Courier,monospace">#!/bin/bash -l<br>
#SBATCH --job-name=FSPMXX<br>
#SBATCH --output=/storage/andreas/camp3.out<br>
#SBATCH --error=/storage/andreas/camp3.err<br>
#SBATCH --nodes=1<br>
#SBATCH --cpus-per-task=1 --tasks-per-node=32
--ntasks-per-core=1<br>
######## #SBATCH -pccm<br>
<br>
srun</font><br>
<br>
This produces in camp3.err:<br>
<font face="Default Monospace,Courier
New,Courier,monospace"><br>
/var/log/slurm/tmp/slurmd.stromboli001.spool/job00101/slurm_script: line
9: 144905 Segmentation fault (core dumped) srun</font><br>
<br>
I tried to recompile pmix and slurm with debug options,
but I cannot get to seem any more information than this.<br>
<br>
I don't think the MPI integration can be broken per se
as jobs run through salloc+srun seem to work fine.<br>
<br>
My understanding of the inner workings of slurm is
virtually nonexistent, so I'll be grateful for any clue
you may offer.<br>
<br>
Andreas (UZH, Switzerland)<br>
</font></slurm-users-bounces@lists.schedmd.com></div>
</div>
</font>
</blockquote>
<br>
</body>
</html>