[slurm-users] slurmstepd crash 18.03 when using pmi2 interface

Martijn Kruiten martijn.kruiten at surfsara.nl
Fri Nov 2 06:06:11 MDT 2018


We pinpointed it to `ConstrainDevices=yes` in cgroup.conf. The solution
was to set `/dev/*` in cgroup_allowed_devices_file.conf. We did not
have anything there. We're now looking into the specific device that is
needed by pmi2.

Martijn Kruiten

On Thu, 2018-11-01 at 18:48 +0100, Bas van der Vlies wrote:
> Oke if we change:
>   * TaskPlugin=task/affinity,task/cgroup
> 
> to:
>   * TaskPlugin=task/affinity
> 
> The pmi2 interface works. Investigating this further
> 
> On 31/10/2018 08:26, Bas van der Vlies wrote:
> > I am busy with migrating from Torque/Moab to SLURM.
> > 
> > I have installed slurm 18.03 and trying to run an mpi program woth
> > the 
> > pmi2 interface.
> > 
> > {{{
> > ~/mpitest> srun --mpi=list
> > srun: MPI types are...
> > srun: none
> > srun: openmpi
> > srun: pmi2
> > }}}
> > 
> > The none and openmpi interface works but the pmi2 interface crashes
> > the 
> > slurmstepd. Have I missed some setting or is this a bug?
> > 
> > {{{
> > (gdb) thread apply all bt
> > 
> > Thread 6 (Thread 0x2b9ce9b8b700 (LWP 21945)):
> > #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at 
> > ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
> > #1  0x00002b9ce5c7862b in ?? () from 
> > /usr/lib/x86_64-linux-gnu/slurm/libslurmfull.so
> > #2  0x00002b9ce6c08494 in start_thread (arg=0x2b9ce9b8b700) at 
> > pthread_create.c:333
> > #3  0x00002b9ce6f06acf in clone () at 
> > ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
> > 
> > Thread 5 (Thread 0x2b9ce9c8c700 (LWP 21946)):
> > #0  0x00002b9ce6efd67d in poll () at ../sysdeps/unix/syscall-
> > template.S:84
> > #1  0x00002b9ce5d16cfb in slurm_eio_handle_mainloop () from 
> > /usr/lib/x86_64-linux-gnu/slurm/libslurmfull.so
> > #2  0x00005631c29f69f6 in ?? ()
> > #3  0x00002b9ce6c08494 in start_thread (arg=0x2b9ce9c8c700) at 
> > pthread_create.c:333
> > #4  0x00002b9ce6f06acf in clone () at 
> > ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
> > 
> > Thread 4 (Thread 0x2b9ceaedb700 (LWP 21948)):
> > #0  0x00002b9ce6efd67d in poll () at ../sysdeps/unix/syscall-
> > template.S:84
> > #1  0x00002b9cea2a8f52 in ?? () from 
> > /usr/lib/x86_64-linux-gnu/slurm//task_cgroup.so
> > #2  0x00002b9ce6c08494 in start_thread (arg=0x2b9ceaedb700) at 
> > pthread_create.c:333
> > #3  0x00002b9ce6f06acf in clone () at 
> > ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
> > 
> > Thread 3 (Thread 0x2b9ceadda700 (LWP 21947)):
> > #0  0x00002b9ce6efd67d in poll () at ../sysdeps/unix/syscall-
> > template.S:84
> > #1  0x00002b9ce5d16cfb in slurm_eio_handle_mainloop () from 
> > /usr/lib/x86_64-linux-gnu/slurm/libslurmfull.so
> > #2  0x00002b9ceaac7355 in ?? () from 
> > /usr/lib/x86_64-linux-gnu/slurm//mpi_pmi2.so
> > #3  0x00002b9ce6c08494 in start_thread (arg=0x2b9ceadda700) at 
> > pthread_create.c:333
> > #4  0x00002b9ce6f06acf in clone () at 
> > ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
> > 
> > Thread 2 (Thread 0x2b9ce5ae0700 (LWP 21944)):
> > #0  pthread_cond_wait@@GLIBC_2.3.2 () at 
> > ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> > #1  0x00002b9ce5c7e65d in ?? () from 
> > /usr/lib/x86_64-linux-gnu/slurm/libslurmfull.so
> > #2  0x00002b9ce6c08494 in start_thread (arg=0x2b9ce5ae0700) at 
> > pthread_create.c:333
> > #3  0x00002b9ce6f06acf in clone () at 
> > ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
> > 
> > Thread 1 (Thread 0x2b9ce59dd080 (LWP 21943)):
> > #0  __GI_raise (sig=sig at entry=6) at
> > ../sysdeps/unix/sysv/linux/raise.c:51
> > #1  0x00002b9ce6e5242a in __GI_abort () at abort.c:89
> > #2  0x00002b9ce6e8ec00 in __libc_message (do_abort=do_abort at entry=2
> > , 
> > fmt=fmt at entry=0x2b9ce6f83d98 "*** Error in `%s': %s: 0x%s ***\n")
> >      at ../sysdeps/posix/libc_fatal.c:175
> > #3  0x00002b9ce6e94fc6 in malloc_printerr (action=3,
> > str=0x2b9ce6f8094a 
> > "free(): invalid pointer", ptr=<optimized out>,
> >      ar_ptr=<optimized out>) at malloc.c:5049
> > #4  0x00002b9ce6e9580e in _int_free (av=0x2b9ce71b7b00
> > <main_arena>, 
> > p=0x2b9ce71bba60 <lock>, have_lock=0) at malloc.c:3905
> > #5  0x00002b9ce5d1084d in slurm_xfree () from 
> > /usr/lib/x86_64-linux-gnu/slurm/libslurmfull.so
> > #6  0x00002b9cea2ab0b0 in task_cgroup_devices_create () from 
> > /usr/lib/x86_64-linux-gnu/slurm//task_cgroup.so
> > #7  0x00002b9cea2a5977 in task_p_pre_setuid () from 
> > /usr/lib/x86_64-linux-gnu/slurm//task_cgroup.so
> > #8  0x00005631c2a04216 in task_g_pre_setuid ()
> > #9  0x00005631c29e713d in ?? ()
> > #10 0x00005631c29ec3f4 in job_manager ()
> > #11 0x00005631c29e9374 in main ()
> > }}}}
> > 
> > 
> > 
> 
> -- 
> --
> Bas van der Vlies
> > Operations, Support & Development | SURFsara | Science Park 140 |
> > 1098 
> XG  Amsterdam
> > T +31 (0) 20 800 1300  | bas.vandervlies at surfsara.nl |
> > www.surfsara.nl |
-- 
| System Programmer | SURFsara | Science Park 140 | 1098 XG Amsterdam |
| T +31 6 20043417  | martijn.kruiten at surfsara.nl | www.surfsara.nl |
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4807 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181102/ea398ce6/attachment.bin>


More information about the slurm-users mailing list