[slurm-users] srun seg faults immediately from within sbatch but not salloc

Tue May 8 17:09:48 MDT 2018

I think this should already be fixed in the upcoming release. See:
https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72

On 5/8/18 12:08 PM, a.vitalis at bioc.uzh.ch wrote:
> Dear all,
>
> I tried to debug this with some apparent success (for now).
>
> If anyone cares:
> With the help of gdb inside sbatch, I tracked down the immediate seg
> fault to strcmp.
> I then hacked src/srun/srun.c with some info statements and isolated
> this function as the culprit:
> static void _setup_env_working_cluster(void)
>
> With my configuration, this routine ended up performing a strcmp of
> two NULL pointers, which seg-faults on our system (and is not
> language-compliant I would think?). My current understanding is that
> this is a slurm bug.
>
> The issue is rectifiable by simply giving the cluster a name in
> slurm.conf (e.g., ClusterName=bla). I am not using slurmdbd btw.
>
> Hope this helps,
> Andreas
>
>
> -----"slurm-users" <slurm-users-bounces at lists.schedmd.com
> <mailto:slurm-users-bounces at lists.schedmd.com>> wrote: -----
> To: slurm-users at lists.schedmd.com <mailto:slurm-users at lists.schedmd.com>
> From: a.vitalis at bioc.uzh.ch <mailto:a.vitalis at bioc.uzh.ch>
> Sent by: "slurm-users"
> Date: 05/08/2018 12:44AM
> Subject: [slurm-users] srun seg faults immediately from within sbatch
> but not salloc
>
> Dear all,
>
> I am trying to set up a small cluster running slurm on Ubuntu 16.04.
> I installed slurm-17.11.5 along with pmix-2.1.1 on an NFS-shared
> partition. Installation seems fine. Munge is taken from the system
> package.
> Something like this:
> ./configure --prefix=/software/slurm/slurm-17.11.5
> --exec-prefix=/software/slurm/Gnu --with-pmix=/software/pmix
> --with-munge=/usr --sysconfdir=/software/slurm/etc
>
> One of the nodes is also the control host and runs both slurmctld and
> slurmd (but the issue is there also if this is not the case). I start
> daemons manually at the moment (slurmctld first).
> My configuration file looks like this (I removed the node-specific parts):
>
> SlurmdUser=root
> #
> AuthType=auth/munge
> # Epilog=/usr/local/slurm/etc/epilog
> FastSchedule=1
> JobCompLoc=/var/log/slurm/slurm.job.log
> JobCompType=jobcomp/filetxt
> JobCredentialPrivateKey=/usr/local/etc/slurm.key
> JobCredentialPublicCertificate=/usr/local/etc/slurm.cert
> #PluginDir=/usr/local/slurm/lib/slurm
> # Prolog=/usr/local/slurm/etc/prolog
> SchedulerType=sched/backfill
> SelectType=select/linear
> SlurmUser=cadmin # this user exists everywhere
> SlurmctldPort=7002
> SlurmctldTimeout=300
> SlurmdPort=7003
> SlurmdTimeout=300
> SwitchType=switch/none
> TreeWidth=50
> #
> # logging
> StateSaveLocation=/var/log/slurm/tmp
> SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool
> SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid
> SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid
> SlurmctldLogFile=/var/log/slurm/slurmctld.log
> SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h
> #
> # job settings
> MaxTasksPerNode=64
> MpiDefault=pmix_v2
>
> # plugins
> TaskPlugin=task/cgroup
>
>
> There are no prolog or epilog scripts.
> After some fiddling with MPI, I got the system to work with
> interactive jobs through salloc (MPI behaves correctly for jobs
> occupying one or all of the nodes). sinfo produces expected results.
> However, as soon as I try to submit through sbatch I get an
> instantaneous seg fault regardless of executable (even when there is
> none specified, i.e., the srun command is meaningless).
>
> When I try to monitor slurmd in the foreground (-vvvv -D), I get
> something like this:
>
> slurmd: debug:  Log file re-opened
> slurmd: debug2: hwloc_topology_init
> slurmd: debug2: hwloc_topology_load
> slurmd: debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16
> ThreadsPerCore:2
> slurmd: Message aggregation disabled
> slurmd: topology NONE plugin loaded
> slurmd: route default plugin loaded
> slurmd: CPU frequency setting not configured for this node
> slurmd: debug:  Resource spec: No specialized cores configured by
> default on this node
> slurmd: debug:  Resource spec: Reserved system memory limit not
> configured for this node
> slurmd: debug:  Reading cgroup.conf file /software/slurm/etc/cgroup.conf
> slurmd: debug2: hwloc_topology_init
> slurmd: debug2: hwloc_topology_load
> slurmd: debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16
> ThreadsPerCore:2
> slurmd: debug:  Reading cgroup.conf file /software/slurm/etc/cgroup.conf
> slurmd: debug:  task/cgroup: loaded
> slurmd: debug:  Munge authentication plugin loaded
> slurmd: debug:  spank: opening plugin stack
> /software/slurm/etc/plugstack.conf
> slurmd: Munge cryptographic signature plugin loaded
> slurmd: slurmd version 17.11.5 started
> slurmd: debug:  Job accounting gather NOT_INVOKED plugin loaded
> slurmd: debug:  job_container none plugin loaded
> slurmd: debug:  switch NONE plugin loaded
> slurmd: slurmd started on Mon, 07 May 2018 23:54:31 +0200
> slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 Memory=64062
> TmpDisk=187611 Uptime=1827335 CPUSpecList=(null) FeaturesAvail=(null)
> FeaturesActive=(null)
> slurmd: debug:  AcctGatherEnergy NONE plugin loaded
> slurmd: debug:  AcctGatherProfile NONE plugin loaded
> slurmd: debug:  AcctGatherInterconnect NONE plugin loaded
> slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
> slurmd: debug2: No acct_gather.conf file
> (/software/slurm/etc/acct_gather.conf)
> slurmd: debug2: got this type of message 4005
> slurmd: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
> slurmd: debug2: _group_cache_lookup_internal: no entry found for andreas
> slurmd: _run_prolog: run job script took usec=5
> slurmd: _run_prolog: prolog with lock for job 100 ran for 0 seconds
> slurmd: Launching batch job 100 for UID 1003
> slurmd: debug2: got this type of message 6011
> slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
> slurmd: debug:  _rpc_terminate_job, uid = 1001
> slurmd: debug:  credential for job 100 revoked
> slurmd: debug2: No steps in jobid 100 to send signal 999
> slurmd: debug2: No steps in jobid 100 to send signal 18
> slurmd: debug2: No steps in jobid 100 to send signal 15
> slurmd: debug2: set revoke expiration for jobid 100 to 1525730207 UTS
> slurmd: debug2: got this type of message 1008
>
> Here, job 100 would be a submission script with something like:
>
> #!/bin/bash -l
> #SBATCH --job-name=FSPMXX
> #SBATCH --output=/storage/andreas/camp3.out
> #SBATCH --error=/storage/andreas/camp3.err
> #SBATCH --nodes=1
> #SBATCH --cpus-per-task=1 --tasks-per-node=32 --ntasks-per-core=1
> ######## #SBATCH -pccm
>
> srun
>
> This produces in camp3.err:
>
> /var/log/slurm/tmp/slurmd.stromboli001.spool/job00101/slurm_script:
> line 9: 144905 Segmentation fault      (core dumped) srun
>
> I tried to recompile pmix and slurm with debug options, but I cannot
> get to seem any more information than this.
>
> I don't think the MPI integration can be broken per se as jobs run
> through salloc+srun seem to work fine.
>
> My understanding of the inner workings of slurm is virtually
> nonexistent, so I'll be grateful for any clue you may offer.
>
> Andreas (UZH, Switzerland)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180508/30f6dd97/attachment-0001.html>