<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    I think this should already be fixed in the upcoming release. See:

<a class="moz-txt-link-freetext" href="https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72">https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72</a><br>

    <br>

    <div class="moz-cite-prefix">On 5/8/18 12:08 PM,

      <a class="moz-txt-link-abbreviated" href="mailto:a.vitalis@bioc.uzh.ch">a.vitalis@bioc.uzh.ch</a> wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:OF73A183AA.26C2E4A8-ONC1258287.00623F1B-C1258287.0063A082@lotus.uzh.ch"><font

        face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif"

        size="2">Dear all,<br>

        <br>

        I tried to debug this with some apparent success (for now).<br>

        <br>

        If anyone cares:<br>

        With the help of gdb inside sbatch, I tracked down the immediate

        seg fault to strcmp.<br>

        I then hacked src/srun/srun.c with some info statements and

        isolated this function as the culprit:<br>

        <font face="Default Monospace,Courier New,Courier,monospace">static

          void _setup_env_working_cluster(void)<br>

          <br>

        </font>With my configuration, this routine ended up performing a

        strcmp of two NULL pointers, which seg-faults on our system (and

        is not language-compliant I would think?). My current

        understanding is that this is a slurm bug.<br>

        <br>

        The issue is rectifiable by simply giving the cluster a name in

        slurm.conf (e.g., ClusterName=bla). I am not using slurmdbd btw.<br>

        <br>

        Hope this helps,<br>

        Andreas<br>

        <br>

        <br>

        <font color="#990099">-----"slurm-users" <<a

            href="mailto:slurm-users-bounces@lists.schedmd.com"

            target="_blank" moz-do-not-send="true">slurm-users-bounces@lists.schedmd.com</a>>

          wrote: -----</font>

        <div class="iNotesHistory" style="padding-left:5px;">

          <div

            style="padding-right:0px;padding-left:5px;border-left:solid

            black 2px;">To: <a

              href="mailto:slurm-users@lists.schedmd.com"

              target="_blank" moz-do-not-send="true">slurm-users@lists.schedmd.com</a><br>

            From: <a href="mailto:a.vitalis@bioc.uzh.ch"

              target="_blank" moz-do-not-send="true">a.vitalis@bioc.uzh.ch</a><br>

            Sent by: "slurm-users" <slurm-users-bounces@lists.schedmd.com><br>

              Date: 05/08/2018 12:44AM<br>

              Subject: [slurm-users] srun seg faults immediately from

              within sbatch but not salloc<br>

              <br>

              <font face="Default Sans

                Serif,Verdana,Arial,Helvetica,sans-serif" size="2">Dear

                all,<br>

                <br>

                I am trying to set up a small cluster running slurm on

                Ubuntu 16.04.<br>

                I installed slurm-17.11.5 along with pmix-2.1.1 on an

                NFS-shared partition. Installation seems fine. Munge is

                taken from the system package.<br>

                Something like this:<br>

                <font face="Default Monospace,Courier

                  New,Courier,monospace">./configure

                  --prefix=/software/slurm/slurm-17.11.5

                  --exec-prefix=/software/slurm/Gnu

                  --with-pmix=/software/pmix --with-munge=/usr

                  --sysconfdir=/software/slurm/etc</font><br>

                <br>

                One of the nodes is also the control host and runs both

                slurmctld and slurmd (but the issue is there also if

                this is not the case). I start daemons manually at the

                moment (slurmctld first).<br>

                My configuration file looks like this (I removed the

                node-specific parts):<br>

                <br>

                <font face="Default Monospace,Courier

                  New,Courier,monospace">SlurmdUser=root<br>

                  #<br>

                  AuthType=auth/munge<br>

                  # Epilog=/usr/local/slurm/etc/epilog<br>

                  FastSchedule=1<br>

                  JobCompLoc=/var/log/slurm/slurm.job.log<br>

                  JobCompType=jobcomp/filetxt<br>

                  JobCredentialPrivateKey=/usr/local/etc/slurm.key<br>

JobCredentialPublicCertificate=/usr/local/etc/slurm.cert<br>

                  #PluginDir=/usr/local/slurm/lib/slurm<br>

                  # Prolog=/usr/local/slurm/etc/prolog<br>

                  SchedulerType=sched/backfill<br>

                  SelectType=select/linear<br>

                  SlurmUser=cadmin # this user exists everywhere<br>

                  SlurmctldPort=7002<br>

                  SlurmctldTimeout=300<br>

                  SlurmdPort=7003<br>

                  SlurmdTimeout=300<br>

                  SwitchType=switch/none<br>

                  TreeWidth=50<br>

                  #<br>

                  # logging<br>

                  StateSaveLocation=/var/log/slurm/tmp<br>

                  SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool<br>

                  SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid<br>

                  SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid<br>

                  SlurmctldLogFile=/var/log/slurm/slurmctld.log<br>

                  SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h<br>

                  #<br>

                  # job settings<br>

                  MaxTasksPerNode=64<br>

                  MpiDefault=pmix_v2<br>

                  <br>

                  # plugins<br>

                  TaskPlugin=task/cgroup</font><br>

                <br>

                <br>

                There are no prolog or epilog scripts.<br>

                After some fiddling with MPI, I got the system to work

                with interactive jobs through salloc (MPI behaves

                correctly for jobs occupying one or all of the nodes).

                sinfo produces expected results.<br>

                However, as soon as I try to submit through sbatch I get

                an instantaneous seg fault regardless of executable

                (even when there is none specified, i.e., the srun

                command is meaningless).<br>

                <br>

                When I try to monitor slurmd in the foreground (-vvvv

                -D), I get something like this:<br>

                <br>

                <font face="Default Monospace,Courier

                  New,Courier,monospace">slurmd: debug:  Log file

                  re-opened<br>

                  slurmd: debug2: hwloc_topology_init<br>

                  slurmd: debug2: hwloc_topology_load<br>

                  slurmd: debug:  CPUs:64 Boards:1 Sockets:2

                  CoresPerSocket:16 ThreadsPerCore:2<br>

                  slurmd: Message aggregation disabled<br>

                  slurmd: topology NONE plugin loaded<br>

                  slurmd: route default plugin loaded<br>

                  slurmd: CPU frequency setting not configured for this

                  node<br>

                  slurmd: debug:  Resource spec: No specialized cores

                  configured by default on this node<br>

                  slurmd: debug:  Resource spec: Reserved system memory

                  limit not configured for this node<br>

                  slurmd: debug:  Reading cgroup.conf file

                  /software/slurm/etc/cgroup.conf<br>

                  slurmd: debug2: hwloc_topology_init<br>

                  slurmd: debug2: hwloc_topology_load<br>

                  slurmd: debug:  CPUs:64 Boards:1 Sockets:2

                  CoresPerSocket:16 ThreadsPerCore:2<br>

                  slurmd: debug:  Reading cgroup.conf file

                  /software/slurm/etc/cgroup.conf<br>

                  slurmd: debug:  task/cgroup: loaded<br>

                  slurmd: debug:  Munge authentication plugin loaded<br>

                  slurmd: debug:  spank: opening plugin stack

                  /software/slurm/etc/plugstack.conf<br>

                  slurmd: Munge cryptographic signature plugin loaded<br>

                  slurmd: slurmd version 17.11.5 started<br>

                  slurmd: debug:  Job accounting gather NOT_INVOKED

                  plugin loaded<br>

                  slurmd: debug:  job_container none plugin loaded<br>

                  slurmd: debug:  switch NONE plugin loaded<br>

                  slurmd: slurmd started on Mon, 07 May 2018 23:54:31

                  +0200<br>

                  slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2

                  Memory=64062 TmpDisk=187611 Uptime=1827335

                  CPUSpecList=(null) FeaturesAvail=(null)

                  FeaturesActive=(null)<br>

                  slurmd: debug:  AcctGatherEnergy NONE plugin loaded<br>

                  slurmd: debug:  AcctGatherProfile NONE plugin loaded<br>

                  slurmd: debug:  AcctGatherInterconnect NONE plugin

                  loaded<br>

                  slurmd: debug:  AcctGatherFilesystem NONE plugin

                  loaded<br>

                  slurmd: debug2: No acct_gather.conf file

                  (/software/slurm/etc/acct_gather.conf)<br>

                  slurmd: debug2: got this type of message 4005<br>

                  slurmd: debug2: Processing RPC:

                  REQUEST_BATCH_JOB_LAUNCH<br>

                  slurmd: debug2: _group_cache_lookup_internal: no entry

                  found for andreas<br>

                  slurmd: _run_prolog: run job script took usec=5<br>

                  slurmd: _run_prolog: prolog with lock for job 100 ran

                  for 0 seconds<br>

                  slurmd: Launching batch job 100 for UID 1003<br>

                  slurmd: debug2: got this type of message 6011<br>

                  slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB<br>

                  slurmd: debug:  _rpc_terminate_job, uid = 1001<br>

                  slurmd: debug:  credential for job 100 revoked<br>

                  slurmd: debug2: No steps in jobid 100 to send signal

                  999<br>

                  slurmd: debug2: No steps in jobid 100 to send signal

                  18<br>

                  slurmd: debug2: No steps in jobid 100 to send signal

                  15<br>

                  slurmd: debug2: set revoke expiration for jobid 100 to

                  1525730207 UTS<br>

                  slurmd: debug2: got this type of message 1008<br>

                </font><br>

                Here, job 100 would be a submission script with

                something like:<br>

                <br>

                <font face="Default Monospace,Courier

                  New,Courier,monospace">#!/bin/bash -l<br>

                  #SBATCH --job-name=FSPMXX<br>

                  #SBATCH --output=/storage/andreas/camp3.out<br>

                  #SBATCH --error=/storage/andreas/camp3.err<br>

                  #SBATCH --nodes=1<br>

                  #SBATCH --cpus-per-task=1 --tasks-per-node=32

                  --ntasks-per-core=1<br>

                  ######## #SBATCH -pccm<br>

                  <br>

                  srun</font><br>

                <br>

                This produces in camp3.err:<br>

                <font face="Default Monospace,Courier

                  New,Courier,monospace"><br>

/var/log/slurm/tmp/slurmd.stromboli001.spool/job00101/slurm_script: line

                  9: 144905 Segmentation fault      (core dumped) srun</font><br>

                <br>

                I tried to recompile pmix and slurm with debug options,

                but I cannot get to seem any more information than this.<br>

                <br>

                I don't think the MPI integration can be broken per se

                as jobs run through salloc+srun seem to work fine.<br>

                <br>

                My understanding of the inner workings of slurm is

                virtually nonexistent, so I'll be grateful for any clue

                you may offer.<br>

                <br>

                Andreas (UZH, Switzerland)<br>

              </font></slurm-users-bounces@lists.schedmd.com></div>

        </div>

      </font>

    </blockquote>

    <br>

  </body>

</html>