<font face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif" size="2">Hi Benjamin,<br><br>thanks for getting back to me! I somehow failed to ever arrive at this page.<br><span class="blob-code-inner"><br>Andreas</span><br><br><font color="#990099">-----"slurm-users" <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>> wrote: -----</font><div class="iNotesHistory" style="padding-left:5px;"><div style="padding-right:0px;padding-left:5px;border-left:solid black 2px;">To: <a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a><br>From: Benjamin Matthews <ben@kc2vjw.com><br>Sent by: "slurm-users" <slurm-users-bounces@lists.schedmd.com><br>Date: 05/09/2018 01:20AM<br>Subject: Re: [slurm-users] srun seg faults immediately from within sbatch but not salloc<br><br>        <!--Notes ACF <meta http-equiv="Content-Type" content="text/html; charset=utf8">-->           I think this should already be fixed in the upcoming release. See: <a class="moz-txt-link-freetext" href="https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72">https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72</a><br>     <br>     <div class="moz-cite-prefix">On 5/8/18 12:08 PM,       <a class="moz-txt-link-abbreviated" href="mailto:a.vitalis@bioc.uzh.ch">a.vitalis@bioc.uzh.ch</a> wrote:<br>     </div>     <blockquote type="cite" cite="mid:OF73A183AA.26C2E4A8-ONC1258287.00623F1B-C1258287.0063A082@lotus.uzh.ch"><font size="2" face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif">Dear all,<br>         <br>         I tried to debug this with some apparent success (for now).<br>         <br>         If anyone cares:<br>         With the help of gdb inside sbatch, I tracked down the immediate         seg fault to strcmp.<br>         I then hacked src/srun/srun.c with some info statements and         isolated this function as the culprit:<br>         <font face="Default Monospace,Courier New,Courier,monospace">static           void _setup_env_working_cluster(void)<br>           <br>         </font>With my configuration, this routine ended up performing a         strcmp of two NULL pointers, which seg-faults on our system (and         is not language-compliant I would think?). My current         understanding is that this is a slurm bug.<br>         <br>         The issue is rectifiable by simply giving the cluster a name in         slurm.conf (e.g., ClusterName=bla). I am not using slurmdbd btw.<br>         <br>         Hope this helps,<br>         Andreas<br>         <br>         <br>         <font color="#990099">-----"slurm-users" <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank" moz-do-not-send="true">slurm-users-bounces@lists.schedmd.com</a>>           wrote: -----</font>         <div class="iNotesHistory" style="padding-left:5px;">           <div style="padding-right:0px;padding-left:5px;border-left:solid            black 2px;">To: <a href="mailto:slurm-users@lists.schedmd.com" target="_blank" moz-do-not-send="true">slurm-users@lists.schedmd.com</a><br>             From: <a href="mailto:a.vitalis@bioc.uzh.ch" target="_blank" moz-do-not-send="true">a.vitalis@bioc.uzh.ch</a><br>             Sent by: "slurm-users" <!--Notes ACF <slurm-users-bounces@lists.schedmd.com>--><br>               Date: 05/08/2018 12:44AM<br>               Subject: [slurm-users] srun seg faults immediately from               within sbatch but not salloc<br>               <br>               <font size="2" face="Default Sans                Serif,Verdana,Arial,Helvetica,sans-serif">Dear                 all,<br>                 <br>                 I am trying to set up a small cluster running slurm on                 Ubuntu 16.04.<br>                 I installed slurm-17.11.5 along with pmix-2.1.1 on an                 NFS-shared partition. Installation seems fine. Munge is                 taken from the system package.<br>                 Something like this:<br>                 <font face="Default Monospace,Courier                  New,Courier,monospace">./configure                   --prefix=/software/slurm/slurm-17.11.5                   --exec-prefix=/software/slurm/Gnu                   --with-pmix=/software/pmix --with-munge=/usr                   --sysconfdir=/software/slurm/etc</font><br>                 <br>                 One of the nodes is also the control host and runs both                 slurmctld and slurmd (but the issue is there also if                 this is not the case). I start daemons manually at the                 moment (slurmctld first).<br>                 My configuration file looks like this (I removed the                 node-specific parts):<br>                 <br>                 <font face="Default Monospace,Courier                  New,Courier,monospace">SlurmdUser=root<br>                   #<br>                   AuthType=auth/munge<br>                   # Epilog=/usr/local/slurm/etc/epilog<br>                   FastSchedule=1<br>                   JobCompLoc=/var/log/slurm/slurm.job.log<br>                   JobCompType=jobcomp/filetxt<br>                   JobCredentialPrivateKey=/usr/local/etc/slurm.key<br> JobCredentialPublicCertificate=/usr/local/etc/slurm.cert<br>                   #PluginDir=/usr/local/slurm/lib/slurm<br>                   # Prolog=/usr/local/slurm/etc/prolog<br>                   SchedulerType=sched/backfill<br>                   SelectType=select/linear<br>                   SlurmUser=cadmin # this user exists everywhere<br>                   SlurmctldPort=7002<br>                   SlurmctldTimeout=300<br>                   SlurmdPort=7003<br>                   SlurmdTimeout=300<br>                   SwitchType=switch/none<br>                   TreeWidth=50<br>                   #<br>                   # logging<br>                   StateSaveLocation=/var/log/slurm/tmp<br>                   SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool<br>                   SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid<br>                   SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid<br>                   SlurmctldLogFile=/var/log/slurm/slurmctld.log<br>                   SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h<br>                   #<br>                   # job settings<br>                   MaxTasksPerNode=64<br>                   MpiDefault=pmix_v2<br>                   <br>                   # plugins<br>                   TaskPlugin=task/cgroup</font><br>                 <br>                 <br>                 There are no prolog or epilog scripts.<br>                 After some fiddling with MPI, I got the system to work                 with interactive jobs through salloc (MPI behaves                 correctly for jobs occupying one or all of the nodes).                 sinfo produces expected results.<br>                 However, as soon as I try to submit through sbatch I get                 an instantaneous seg fault regardless of executable                 (even when there is none specified, i.e., the srun                 command is meaningless).<br>                 <br>                 When I try to monitor slurmd in the foreground (-vvvv                 -D), I get something like this:<br>                 <br>                 <font face="Default Monospace,Courier                  New,Courier,monospace">slurmd: debug:  Log file                   re-opened<br>                   slurmd: debug2: hwloc_topology_init<br>                   slurmd: debug2: hwloc_topology_load<br>                   slurmd: debug:  CPUs:64 Boards:1 Sockets:2                   CoresPerSocket:16 ThreadsPerCore:2<br>                   slurmd: Message aggregation disabled<br>                   slurmd: topology NONE plugin loaded<br>                   slurmd: route default plugin loaded<br>                   slurmd: CPU frequency setting not configured for this                   node<br>                   slurmd: debug:  Resource spec: No specialized cores                   configured by default on this node<br>                   slurmd: debug:  Resource spec: Reserved system memory                   limit not configured for this node<br>                   slurmd: debug:  Reading cgroup.conf file                   /software/slurm/etc/cgroup.conf<br>                   slurmd: debug2: hwloc_topology_init<br>                   slurmd: debug2: hwloc_topology_load<br>                   slurmd: debug:  CPUs:64 Boards:1 Sockets:2                   CoresPerSocket:16 ThreadsPerCore:2<br>                   slurmd: debug:  Reading cgroup.conf file                   /software/slurm/etc/cgroup.conf<br>                   slurmd: debug:  task/cgroup: loaded<br>                   slurmd: debug:  Munge authentication plugin loaded<br>                   slurmd: debug:  spank: opening plugin stack                   /software/slurm/etc/plugstack.conf<br>                   slurmd: Munge cryptographic signature plugin loaded<br>                   slurmd: slurmd version 17.11.5 started<br>                   slurmd: debug:  Job accounting gather NOT_INVOKED                   plugin loaded<br>                   slurmd: debug:  job_container none plugin loaded<br>                   slurmd: debug:  switch NONE plugin loaded<br>                   slurmd: slurmd started on Mon, 07 May 2018 23:54:31                   +0200<br>                   slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2                   Memory=64062 TmpDisk=187611 Uptime=1827335                   CPUSpecList=(null) FeaturesAvail=(null)                   FeaturesActive=(null)<br>                   slurmd: debug:  AcctGatherEnergy NONE plugin loaded<br>                   slurmd: debug:  AcctGatherProfile NONE plugin loaded<br>                   slurmd: debug:  AcctGatherInterconnect NONE plugin                   loaded<br>                   slurmd: debug:  AcctGatherFilesystem NONE plugin                   loaded<br>                   slurmd: debug2: No acct_gather.conf file                   (/software/slurm/etc/acct_gather.conf)<br>                   slurmd: debug2: got this type of message 4005<br>                   slurmd: debug2: Processing RPC:                   REQUEST_BATCH_JOB_LAUNCH<br>                   slurmd: debug2: _group_cache_lookup_internal: no entry                   found for andreas<br>                   slurmd: _run_prolog: run job script took usec=5<br>                   slurmd: _run_prolog: prolog with lock for job 100 ran                   for 0 seconds<br>                   slurmd: Launching batch job 100 for UID 1003<br>                   slurmd: debug2: got this type of message 6011<br>                   slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB<br>                   slurmd: debug:  _rpc_terminate_job, uid = 1001<br>                   slurmd: debug:  credential for job 100 revoked<br>                   slurmd: debug2: No steps in jobid 100 to send signal                   999<br>                   slurmd: debug2: No steps in jobid 100 to send signal                   18<br>                   slurmd: debug2: No steps in jobid 100 to send signal                   15<br>                   slurmd: debug2: set revoke expiration for jobid 100 to                   1525730207 UTS<br>                   slurmd: debug2: got this type of message 1008<br>                 </font><br>                 Here, job 100 would be a submission script with                 something like:<br>                 <br>                 <font face="Default Monospace,Courier                  New,Courier,monospace">#!/bin/bash -l<br>                   #SBATCH --job-name=FSPMXX<br>                   #SBATCH --output=/storage/andreas/camp3.out<br>                   #SBATCH --error=/storage/andreas/camp3.err<br>                   #SBATCH --nodes=1<br>                   #SBATCH --cpus-per-task=1 --tasks-per-node=32                   --ntasks-per-core=1<br>                   ######## #SBATCH -pccm<br>                   <br>                   srun</font><br>                 <br>                 This produces in camp3.err:<br>                 <font face="Default Monospace,Courier                  New,Courier,monospace"><br> /var/log/slurm/tmp/slurmd.stromboli001.spool/job00101/slurm_script: line                   9: 144905 Segmentation fault      (core dumped) srun</font><br>                 <br>                 I tried to recompile pmix and slurm with debug options,                 but I cannot get to seem any more information than this.<br>                 <br>                 I don't think the MPI integration can be broken per se                 as jobs run through salloc+srun seem to work fine.<br>                 <br>                 My understanding of the inner workings of slurm is                 virtually nonexistent, so I'll be grateful for any clue                 you may offer.<br>                 <br>                 Andreas (UZH, Switzerland)<br>               </font><!--Notes ACF </slurm-users-bounces@lists.schedmd.com>--></div>         </div>       </font>     </blockquote>     <br>    </slurm-users-bounces@lists.schedmd.com></ben@kc2vjw.com></div></div><div></div></font>