[slurm-users] GUI application crash on first allocation, but runs fine on second allocation

Brian Andrus toomuchit at gmail.com
Tue Jun 9 19:01:03 UTC 2020


Sounds like a race condition where slurmd is starting before the node is 
truly ready.

You can try adding dependencies for slurmd so it will not start until 
some other needed service is running.


The benefits of systemd :)


Brian Andrus


On 6/9/2020 10:53 AM, Dumont, Joey wrote:
>
> Hi,
>
>
> I am encountering a weird issue, and I'm not sure where it is coming from.
>
>
> I have setup a slurm-based cluster using AWS ParallelCluster. I have 
> tweaked the slurm configuration to enable X forwarding by setting 
> PrologFlags=X11. The ParallelCluster portion is relevant, as basically 
> every time a user queues a job, a brand new compute node is 
> provisioned, and added to the default queue. Users want to run a GUI 
> based application based on Qt5. To run it, they issue something like:
>
>
>     salloc --nodes=1 --ntasks=1 --cpus-per-task=48 --x11=all srun
>     run_lsf.sh
>
>
> However, if there are no nodes available, a new one is provisioned and 
> the job is run on the new node. Every time this job is the first job 
> on the compute node, the application crashes. If I issue the exact 
> same command a second time (it usually gets allocated to the same 
> node), then it runs without any issues. I was able to retrieve this 
> from the core dump:
>
>     (gdb) bt
>     #0  0x00007fffdfced337 in raise () from /lib64/libc.so.6
>     #1  0x00007fffdfceea28 in abort () from /lib64/libc.so.6
>     #2  0x00007fffe2e699db in QMessageLogger::fatal(char const*, ...) const ()
>         from /shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Core.so.5
>     #3  0x00007fffe44ce28b in QGuiApplicationPrivate::createPlatformIntegration() ()
>         from /shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Gui.so.5
>     #4  0x00007fffe44ce72d in QGuiApplicationPrivate::createEventDispatcher() ()
>         from /shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Gui.so.5
>     #5  0x00007fffe30579f5 in QCoreApplicationPrivate::init() ()
>         from /shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Core.so.5
>     #6  0x00007fffe44cfcec in QGuiApplicationPrivate::init() ()
>         from /shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Gui.so.5
>     #7  0x00007fffe4cfcca9 in QApplicationPrivate::init() ()
>         from /shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Widgets.so.5
>     #8  0x0000000001f17345 in ?? ()
>     #9  0x00000000005286bb in ?? ()
>     #10 0x00007fffdfcd9505 in __libc_start_main () from /lib64/libc.so.6
>     #11 0x0000000000522201 in ?? ()
>     #12 0x00007fffffff3928 in ?? ()
>     #13 0x000000000000001c in ?? ()
>     #14 0x0000000000000004 in ?? ()
>     #15 0x00007fffffff3c5e in ?? ()
>     #16 0x00007fffffff3cfd in ?? ()
>     #17 0x00007fffffff3d01 in ?? ()
>     #18 0x00007fffffff3d06 in ?? ()
>     #19 0x0000000000000000 in ?? ()
>
>
> So it seems that the Qt5 application cannot initialize, possibly due 
> to the X server not being ready? I tried adding a delay before 
> starting starting the GUI application, but that didn't seem to help.
>
> Do you have any idea of where to look for relevant errors? 
> /var/log/messages indicates that the app crashed, without any 
> additional information.
>
> The nodes are running on CentOS 7.
>
> Let me know if additional info is needed.
>
> Cheers,
>
> Joey Dumont
>
> Technical Advisor, Knowledge, Information, and Technology Services
> National Research Council Canada / Governement of Canada
> joey.dumont at nrc-cnrc.gc.ca <mailto:joey.dumont at nrc-cnrc.gc.ca> / Tel: 
> 613-990-8152 / Cell: 438-340-7436
>
> Conseiller technique, Services du savoir, de l'information et de la 
> technologie
> Conseil national de recherches Canada / Gouvernement du Canada
> joey.dumont at nrc-cnrc.gc.ca <mailto:joey.dumont at nrc-cnrc.gc.ca> / Tél.: 
> 613-990-8152 / Tél. cell.: 438-340-7436
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200609/aad8264a/attachment.htm>


More information about the slurm-users mailing list