[slurm-users] srun seg faults immediately from within sbatch but not salloc

Tue May 8 12:08:11 MDT 2018

Dear all,

I tried to debug this with some apparent success (for now).

If anyone cares:
With the help of gdb inside sbatch, I tracked down the immediate seg fault to strcmp.
I then hacked src/srun/srun.c with some info statements and isolated this function as the culprit:
static void _setup_env_working_cluster(void)

With my configuration, this routine ended up performing a strcmp of two NULL pointers, which seg-faults on our system (and is not language-compliant I would think?). My current understanding is that this is a slurm bug.

The issue is rectifiable by simply giving the cluster a name in slurm.conf (e.g., ClusterName=bla). I am not using slurmdbd btw.

Hope this helps,
Andreas

-----"slurm-users" <slurm-users-bounces at lists.schedmd.com> wrote: -----
To: slurm-users at lists.schedmd.com
From: a.vitalis at bioc.uzh.ch
Sent by: "slurm-users" 
Date: 05/08/2018 12:44AM
Subject: [slurm-users] srun seg faults immediately from within sbatch but	not salloc

Dear all,

I am trying to set up a small cluster running slurm on Ubuntu 16.04.
I installed slurm-17.11.5 along with pmix-2.1.1 on an NFS-shared partition. Installation seems fine. Munge is taken from the system package.
Something like this:
./configure --prefix=/software/slurm/slurm-17.11.5 --exec-prefix=/software/slurm/Gnu --with-pmix=/software/pmix --with-munge=/usr --sysconfdir=/software/slurm/etc

One of the nodes is also the control host and runs both slurmctld and slurmd (but the issue is there also if this is not the case). I start daemons manually at the moment (slurmctld first).
My configuration file looks like this (I removed the node-specific parts):

SlurmdUser=root
#
AuthType=auth/munge
# Epilog=/usr/local/slurm/etc/epilog
FastSchedule=1
JobCompLoc=/var/log/slurm/slurm.job.log
JobCompType=jobcomp/filetxt
JobCredentialPrivateKey=/usr/local/etc/slurm.key
JobCredentialPublicCertificate=/usr/local/etc/slurm.cert
#PluginDir=/usr/local/slurm/lib/slurm
# Prolog=/usr/local/slurm/etc/prolog
SchedulerType=sched/backfill
SelectType=select/linear
SlurmUser=cadmin # this user exists everywhere
SlurmctldPort=7002
SlurmctldTimeout=300
SlurmdPort=7003
SlurmdTimeout=300
SwitchType=switch/none
TreeWidth=50
#
# logging
StateSaveLocation=/var/log/slurm/tmp
SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool
SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid
SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h
#
# job settings
MaxTasksPerNode=64
MpiDefault=pmix_v2

# plugins
TaskPlugin=task/cgroup

There are no prolog or epilog scripts.
After some fiddling with MPI, I got the system to work with interactive jobs  through salloc (MPI behaves correctly for jobs occupying one or all of  the nodes). sinfo produces expected results.
However, as soon as I try to submit through sbatch I get an instantaneous seg fault regardless of executable (even when there is none specified, i.e., the srun command is meaningless).

When I try to monitor slurmd in the foreground (-vvvv -D), I get something like this:

slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
slurmd: debug:  Reading cgroup.conf file /software/slurm/etc/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2
slurmd: debug:  Reading cgroup.conf file /software/slurm/etc/cgroup.conf
slurmd: debug:  task/cgroup: loaded
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug:  spank: opening plugin stack /software/slurm/etc/plugstack.conf
slurmd: Munge cryptographic signature plugin loaded
slurmd: slurmd version 17.11.5 started
slurmd: debug:  Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug:  job_container none plugin loaded
slurmd: debug:  switch NONE plugin loaded
slurmd: slurmd started on Mon, 07 May 2018 23:54:31 +0200
slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 Memory=64062 TmpDisk=187611 Uptime=1827335 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug:  AcctGatherEnergy NONE plugin loaded
slurmd: debug:  AcctGatherProfile NONE plugin loaded
slurmd: debug:  AcctGatherInterconnect NONE plugin loaded
slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/software/slurm/etc/acct_gather.conf)
slurmd: debug2: got this type of message 4005
slurmd: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
slurmd: debug2: _group_cache_lookup_internal: no entry found for andreas
slurmd: _run_prolog: run job script took usec=5
slurmd: _run_prolog: prolog with lock for job 100 ran for 0 seconds
slurmd: Launching batch job 100 for UID 1003
slurmd: debug2: got this type of message 6011
slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug:  _rpc_terminate_job, uid = 1001
slurmd: debug:  credential for job 100 revoked
slurmd: debug2: No steps in jobid 100 to send signal 999
slurmd: debug2: No steps in jobid 100 to send signal 18
slurmd: debug2: No steps in jobid 100 to send signal 15
slurmd: debug2: set revoke expiration for jobid 100 to 1525730207 UTS
slurmd: debug2: got this type of message 1008

Here, job 100 would be a submission script with something like:

#!/bin/bash -l
#SBATCH --job-name=FSPMXX
#SBATCH --output=/storage/andreas/camp3.out
#SBATCH --error=/storage/andreas/camp3.err
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1 --tasks-per-node=32 --ntasks-per-core=1
######## #SBATCH -pccm

srun

This produces in camp3.err:

/var/log/slurm/tmp/slurmd.stromboli001.spool/job00101/slurm_script: line 9: 144905 Segmentation fault      (core dumped) srun

I tried to recompile pmix and slurm with debug options, but I cannot get to seem any more information than this.

I don't think the MPI integration can be broken per se as jobs run through salloc+srun seem to work fine.

My understanding of the inner workings of slurm is virtually nonexistent, so I'll be grateful for any clue you may offer.

Andreas (UZH, Switzerland)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180508/c91f0558/attachment.html>