[slurm-users] slurmstepd crash 18.03 when using pmi2 interface

Matthieu Hautreux matthieu.hautreux at gmail.com
Fri Nov 2 06:15:13 MDT 2018


This may be due because of this commit :
https://github.com/SchedMD/slurm/commit/ee2813870fed48827aa0ec99e1b4baeaca710755

It seems that the behavior was changed from a fatal error to something
different when requesting cgroup devices on in cgroup.conf without the
proper conf file.

If you do not really need to constrain devices then remove the constrain
devices=yes.

Regards

Le 1 nov. 2018 6:51 PM, "Bas van der Vlies" <bas.vandervlies at surfsara.nl> a
écrit :

Oke if we change:
  * TaskPlugin=task/affinity,task/cgroup

to:
  * TaskPlugin=task/affinity

The pmi2 interface works. Investigating this further


On 31/10/2018 08:26, Bas van der Vlies wrote:
> I am busy with migrating from Torque/Moab to SLURM.
>
> I have installed slurm 18.03 and trying to run an mpi program woth the
> pmi2 interface.
>
> {{{
> ~/mpitest> srun --mpi=list
> srun: MPI types are...
> srun: none
> srun: openmpi
> srun: pmi2
> }}}
>
> The none and openmpi interface works but the pmi2 interface crashes the
> slurmstepd. Have I missed some setting or is this a bug?
>
> {{{
> (gdb) thread apply all bt
>
> Thread 6 (Thread 0x2b9ce9b8b700 (LWP 21945)):
> #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at
> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
> #1  0x00002b9ce5c7862b in ?? () from
> /usr/lib/x86_64-linux-gnu/slurm/libslurmfull.so
> #2  0x00002b9ce6c08494 in start_thread (arg=0x2b9ce9b8b700) at
> pthread_create.c:333
> #3  0x00002b9ce6f06acf in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
>
> Thread 5 (Thread 0x2b9ce9c8c700 (LWP 21946)):
> #0  0x00002b9ce6efd67d in poll () at ../sysdeps/unix/syscall-template.S:84
> #1  0x00002b9ce5d16cfb in slurm_eio_handle_mainloop () from
> /usr/lib/x86_64-linux-gnu/slurm/libslurmfull.so
> #2  0x00005631c29f69f6 in ?? ()
> #3  0x00002b9ce6c08494 in start_thread (arg=0x2b9ce9c8c700) at
> pthread_create.c:333
> #4  0x00002b9ce6f06acf in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
>
> Thread 4 (Thread 0x2b9ceaedb700 (LWP 21948)):
> #0  0x00002b9ce6efd67d in poll () at ../sysdeps/unix/syscall-template.S:84
> #1  0x00002b9cea2a8f52 in ?? () from
> /usr/lib/x86_64-linux-gnu/slurm//task_cgroup.so
> #2  0x00002b9ce6c08494 in start_thread (arg=0x2b9ceaedb700) at
> pthread_create.c:333
> #3  0x00002b9ce6f06acf in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
>
> Thread 3 (Thread 0x2b9ceadda700 (LWP 21947)):
> #0  0x00002b9ce6efd67d in poll () at ../sysdeps/unix/syscall-template.S:84
> #1  0x00002b9ce5d16cfb in slurm_eio_handle_mainloop () from
> /usr/lib/x86_64-linux-gnu/slurm/libslurmfull.so
> #2  0x00002b9ceaac7355 in ?? () from
> /usr/lib/x86_64-linux-gnu/slurm//mpi_pmi2.so
> #3  0x00002b9ce6c08494 in start_thread (arg=0x2b9ceadda700) at
> pthread_create.c:333
> #4  0x00002b9ce6f06acf in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
>
> Thread 2 (Thread 0x2b9ce5ae0700 (LWP 21944)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at
> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1  0x00002b9ce5c7e65d in ?? () from
> /usr/lib/x86_64-linux-gnu/slurm/libslurmfull.so
> #2  0x00002b9ce6c08494 in start_thread (arg=0x2b9ce5ae0700) at
> pthread_create.c:333
> #3  0x00002b9ce6f06acf in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
>
> Thread 1 (Thread 0x2b9ce59dd080 (LWP 21943)):
> #0  __GI_raise (sig=sig at entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x00002b9ce6e5242a in __GI_abort () at abort.c:89
> #2  0x00002b9ce6e8ec00 in __libc_message (do_abort=do_abort at entry=2,
> fmt=fmt at entry=0x2b9ce6f83d98 "*** Error in `%s': %s: 0x%s ***\n")
>      at ../sysdeps/posix/libc_fatal.c:175
> #3  0x00002b9ce6e94fc6 in malloc_printerr (action=3, str=0x2b9ce6f8094a
> "free(): invalid pointer", ptr=<optimized out>,
>      ar_ptr=<optimized out>) at malloc.c:5049
> #4  0x00002b9ce6e9580e in _int_free (av=0x2b9ce71b7b00 <main_arena>,
> p=0x2b9ce71bba60 <lock>, have_lock=0) at malloc.c:3905
> #5  0x00002b9ce5d1084d in slurm_xfree () from
> /usr/lib/x86_64-linux-gnu/slurm/libslurmfull.so
> #6  0x00002b9cea2ab0b0 in task_cgroup_devices_create () from
> /usr/lib/x86_64-linux-gnu/slurm//task_cgroup.so
> #7  0x00002b9cea2a5977 in task_p_pre_setuid () from
> /usr/lib/x86_64-linux-gnu/slurm//task_cgroup.so
> #8  0x00005631c2a04216 in task_g_pre_setuid ()
> #9  0x00005631c29e713d in ?? ()
> #10 0x00005631c29ec3f4 in job_manager ()
> #11 0x00005631c29e9374 in main ()
> }}}}
>
>
>

-- 
--
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098
XG  Amsterdam
| T +31 (0) 20 800 1300  | bas.vandervlies at surfsara.nl | www.surfsara.nl |
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181102/e0ee3eaf/attachment-0001.html>


More information about the slurm-users mailing list