<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    I think this should already be fixed in the upcoming release. See:
<a class="moz-txt-link-freetext" href="https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72">https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72</a><br>
    <br>
    <div class="moz-cite-prefix">On 5/8/18 12:08 PM,
      <a class="moz-txt-link-abbreviated" href="mailto:a.vitalis@bioc.uzh.ch">a.vitalis@bioc.uzh.ch</a> wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:OF73A183AA.26C2E4A8-ONC1258287.00623F1B-C1258287.0063A082@lotus.uzh.ch"><font
        face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif"
        size="2">Dear all,<br>
        <br>
        I tried to debug this with some apparent success (for now).<br>
        <br>
        If anyone cares:<br>
        With the help of gdb inside sbatch, I tracked down the immediate
        seg fault to strcmp.<br>
        I then hacked src/srun/srun.c with some info statements and
        isolated this function as the culprit:<br>
        <font face="Default Monospace,Courier New,Courier,monospace">static
          void _setup_env_working_cluster(void)<br>
          <br>
        </font>With my configuration, this routine ended up performing a
        strcmp of two NULL pointers, which seg-faults on our system (and
        is not language-compliant I would think?). My current
        understanding is that this is a slurm bug.<br>
        <br>
        The issue is rectifiable by simply giving the cluster a name in
        slurm.conf (e.g., ClusterName=bla). I am not using slurmdbd btw.<br>
        <br>
        Hope this helps,<br>
        Andreas<br>
        <br>
        <br>
        <font color="#990099">-----"slurm-users" <<a
            href="mailto:slurm-users-bounces@lists.schedmd.com"
            target="_blank" moz-do-not-send="true">slurm-users-bounces@lists.schedmd.com</a>>
          wrote: -----</font>
        <div class="iNotesHistory" style="padding-left:5px;">
          <div
            style="padding-right:0px;padding-left:5px;border-left:solid
            black 2px;">To: <a
              href="mailto:slurm-users@lists.schedmd.com"
              target="_blank" moz-do-not-send="true">slurm-users@lists.schedmd.com</a><br>
            From: <a href="mailto:a.vitalis@bioc.uzh.ch"
              target="_blank" moz-do-not-send="true">a.vitalis@bioc.uzh.ch</a><br>
            Sent by: "slurm-users" <slurm-users-bounces@lists.schedmd.com><br>
              Date: 05/08/2018 12:44AM<br>
              Subject: [slurm-users] srun seg faults immediately from
              within sbatch but not salloc<br>
              <br>
              <font face="Default Sans
                Serif,Verdana,Arial,Helvetica,sans-serif" size="2">Dear
                all,<br>
                <br>
                I am trying to set up a small cluster running slurm on
                Ubuntu 16.04.<br>
                I installed slurm-17.11.5 along with pmix-2.1.1 on an
                NFS-shared partition. Installation seems fine. Munge is
                taken from the system package.<br>
                Something like this:<br>
                <font face="Default Monospace,Courier
                  New,Courier,monospace">./configure
                  --prefix=/software/slurm/slurm-17.11.5
                  --exec-prefix=/software/slurm/Gnu
                  --with-pmix=/software/pmix --with-munge=/usr
                  --sysconfdir=/software/slurm/etc</font><br>
                <br>
                One of the nodes is also the control host and runs both
                slurmctld and slurmd (but the issue is there also if
                this is not the case). I start daemons manually at the
                moment (slurmctld first).<br>
                My configuration file looks like this (I removed the
                node-specific parts):<br>
                <br>
                <font face="Default Monospace,Courier
                  New,Courier,monospace">SlurmdUser=root<br>
                  #<br>
                  AuthType=auth/munge<br>
                  # Epilog=/usr/local/slurm/etc/epilog<br>
                  FastSchedule=1<br>
                  JobCompLoc=/var/log/slurm/slurm.job.log<br>
                  JobCompType=jobcomp/filetxt<br>
                  JobCredentialPrivateKey=/usr/local/etc/slurm.key<br>
JobCredentialPublicCertificate=/usr/local/etc/slurm.cert<br>
                  #PluginDir=/usr/local/slurm/lib/slurm<br>
                  # Prolog=/usr/local/slurm/etc/prolog<br>
                  SchedulerType=sched/backfill<br>
                  SelectType=select/linear<br>
                  SlurmUser=cadmin # this user exists everywhere<br>
                  SlurmctldPort=7002<br>
                  SlurmctldTimeout=300<br>
                  SlurmdPort=7003<br>
                  SlurmdTimeout=300<br>
                  SwitchType=switch/none<br>
                  TreeWidth=50<br>
                  #<br>
                  # logging<br>
                  StateSaveLocation=/var/log/slurm/tmp<br>
                  SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool<br>
                  SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid<br>
                  SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid<br>
                  SlurmctldLogFile=/var/log/slurm/slurmctld.log<br>
                  SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h<br>
                  #<br>
                  # job settings<br>
                  MaxTasksPerNode=64<br>
                  MpiDefault=pmix_v2<br>
                  <br>
                  # plugins<br>
                  TaskPlugin=task/cgroup</font><br>
                <br>
                <br>
                There are no prolog or epilog scripts.<br>
                After some fiddling with MPI, I got the system to work
                with interactive jobs through salloc (MPI behaves
                correctly for jobs occupying one or all of the nodes).
                sinfo produces expected results.<br>
                However, as soon as I try to submit through sbatch I get
                an instantaneous seg fault regardless of executable
                (even when there is none specified, i.e., the srun
                command is meaningless).<br>
                <br>
                When I try to monitor slurmd in the foreground (-vvvv
                -D), I get something like this:<br>
                <br>
                <font face="Default Monospace,Courier
                  New,Courier,monospace">slurmd: debug:  Log file
                  re-opened<br>
                  slurmd: debug2: hwloc_topology_init<br>
                  slurmd: debug2: hwloc_topology_load<br>
                  slurmd: debug:  CPUs:64 Boards:1 Sockets:2
                  CoresPerSocket:16 ThreadsPerCore:2<br>
                  slurmd: Message aggregation disabled<br>
                  slurmd: topology NONE plugin loaded<br>
                  slurmd: route default plugin loaded<br>
                  slurmd: CPU frequency setting not configured for this
                  node<br>
                  slurmd: debug:  Resource spec: No specialized cores
                  configured by default on this node<br>
                  slurmd: debug:  Resource spec: Reserved system memory
                  limit not configured for this node<br>
                  slurmd: debug:  Reading cgroup.conf file
                  /software/slurm/etc/cgroup.conf<br>
                  slurmd: debug2: hwloc_topology_init<br>
                  slurmd: debug2: hwloc_topology_load<br>
                  slurmd: debug:  CPUs:64 Boards:1 Sockets:2
                  CoresPerSocket:16 ThreadsPerCore:2<br>
                  slurmd: debug:  Reading cgroup.conf file
                  /software/slurm/etc/cgroup.conf<br>
                  slurmd: debug:  task/cgroup: loaded<br>
                  slurmd: debug:  Munge authentication plugin loaded<br>
                  slurmd: debug:  spank: opening plugin stack
                  /software/slurm/etc/plugstack.conf<br>
                  slurmd: Munge cryptographic signature plugin loaded<br>
                  slurmd: slurmd version 17.11.5 started<br>
                  slurmd: debug:  Job accounting gather NOT_INVOKED
                  plugin loaded<br>
                  slurmd: debug:  job_container none plugin loaded<br>
                  slurmd: debug:  switch NONE plugin loaded<br>
                  slurmd: slurmd started on Mon, 07 May 2018 23:54:31
                  +0200<br>
                  slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2
                  Memory=64062 TmpDisk=187611 Uptime=1827335
                  CPUSpecList=(null) FeaturesAvail=(null)
                  FeaturesActive=(null)<br>
                  slurmd: debug:  AcctGatherEnergy NONE plugin loaded<br>
                  slurmd: debug:  AcctGatherProfile NONE plugin loaded<br>
                  slurmd: debug:  AcctGatherInterconnect NONE plugin
                  loaded<br>
                  slurmd: debug:  AcctGatherFilesystem NONE plugin
                  loaded<br>
                  slurmd: debug2: No acct_gather.conf file
                  (/software/slurm/etc/acct_gather.conf)<br>
                  slurmd: debug2: got this type of message 4005<br>
                  slurmd: debug2: Processing RPC:
                  REQUEST_BATCH_JOB_LAUNCH<br>
                  slurmd: debug2: _group_cache_lookup_internal: no entry
                  found for andreas<br>
                  slurmd: _run_prolog: run job script took usec=5<br>
                  slurmd: _run_prolog: prolog with lock for job 100 ran
                  for 0 seconds<br>
                  slurmd: Launching batch job 100 for UID 1003<br>
                  slurmd: debug2: got this type of message 6011<br>
                  slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB<br>
                  slurmd: debug:  _rpc_terminate_job, uid = 1001<br>
                  slurmd: debug:  credential for job 100 revoked<br>
                  slurmd: debug2: No steps in jobid 100 to send signal
                  999<br>
                  slurmd: debug2: No steps in jobid 100 to send signal
                  18<br>
                  slurmd: debug2: No steps in jobid 100 to send signal
                  15<br>
                  slurmd: debug2: set revoke expiration for jobid 100 to
                  1525730207 UTS<br>
                  slurmd: debug2: got this type of message 1008<br>
                </font><br>
                Here, job 100 would be a submission script with
                something like:<br>
                <br>
                <font face="Default Monospace,Courier
                  New,Courier,monospace">#!/bin/bash -l<br>
                  #SBATCH --job-name=FSPMXX<br>
                  #SBATCH --output=/storage/andreas/camp3.out<br>
                  #SBATCH --error=/storage/andreas/camp3.err<br>
                  #SBATCH --nodes=1<br>
                  #SBATCH --cpus-per-task=1 --tasks-per-node=32
                  --ntasks-per-core=1<br>
                  ######## #SBATCH -pccm<br>
                  <br>
                  srun</font><br>
                <br>
                This produces in camp3.err:<br>
                <font face="Default Monospace,Courier
                  New,Courier,monospace"><br>
/var/log/slurm/tmp/slurmd.stromboli001.spool/job00101/slurm_script: line
                  9: 144905 Segmentation fault      (core dumped) srun</font><br>
                <br>
                I tried to recompile pmix and slurm with debug options,
                but I cannot get to seem any more information than this.<br>
                <br>
                I don't think the MPI integration can be broken per se
                as jobs run through salloc+srun seem to work fine.<br>
                <br>
                My understanding of the inner workings of slurm is
                virtually nonexistent, so I'll be grateful for any clue
                you may offer.<br>
                <br>
                Andreas (UZH, Switzerland)<br>
              </font></slurm-users-bounces@lists.schedmd.com></div>
        </div>
      </font>
    </blockquote>
    <br>
  </body>
</html>