[slurm-users] GUI application crash on first allocation, but runs fine on second allocation
Dumont, Joey
Joey.Dumont at nrc-cnrc.gc.ca
Tue Jun 9 17:53:09 UTC 2020
Hi,
I am encountering a weird issue, and I'm not sure where it is coming from.
I have setup a slurm-based cluster using AWS ParallelCluster. I have tweaked the slurm configuration to enable X forwarding by setting PrologFlags=X11. The ParallelCluster portion is relevant, as basically every time a user queues a job, a brand new compute node is provisioned, and added to the default queue. Users want to run a GUI based application based on Qt5. To run it, they issue something like:
salloc --nodes=1 --ntasks=1 --cpus-per-task=48 --x11=all srun run_lsf.sh
However, if there are no nodes available, a new one is provisioned and the job is run on the new node. Every time this job is the first job on the compute node, the application crashes. If I issue the exact same command a second time (it usually gets allocated to the same node), then it runs without any issues. I was able to retrieve this from the core dump:
(gdb) bt
#0 0x00007fffdfced337 in raise () from /lib64/libc.so.6
#1 0x00007fffdfceea28 in abort () from /lib64/libc.so.6
#2 0x00007fffe2e699db in QMessageLogger::fatal(char const*, ...) const ()
from /shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Core.so.5
#3 0x00007fffe44ce28b in QGuiApplicationPrivate::createPlatformIntegration() ()
from /shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Gui.so.5
#4 0x00007fffe44ce72d in QGuiApplicationPrivate::createEventDispatcher() ()
from /shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Gui.so.5
#5 0x00007fffe30579f5 in QCoreApplicationPrivate::init() ()
from /shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Core.so.5
#6 0x00007fffe44cfcec in QGuiApplicationPrivate::init() ()
from /shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Gui.so.5
#7 0x00007fffe4cfcca9 in QApplicationPrivate::init() ()
from /shared/opt/spack/opt/spack/linux-centos7-cascadelake/gcc-9.2.0/lumerical-2020a-r5-mt7ihfs2o3wfpxrn2ciw2oqfoqvo34dl/opt/lumerical/2020a/bin/../lib/libQt5Widgets.so.5
#8 0x0000000001f17345 in ?? ()
#9 0x00000000005286bb in ?? ()
#10 0x00007fffdfcd9505 in __libc_start_main () from /lib64/libc.so.6
#11 0x0000000000522201 in ?? ()
#12 0x00007fffffff3928 in ?? ()
#13 0x000000000000001c in ?? ()
#14 0x0000000000000004 in ?? ()
#15 0x00007fffffff3c5e in ?? ()
#16 0x00007fffffff3cfd in ?? ()
#17 0x00007fffffff3d01 in ?? ()
#18 0x00007fffffff3d06 in ?? ()
#19 0x0000000000000000 in ?? ()
So it seems that the Qt5 application cannot initialize, possibly due to the X server not being ready? I tried adding a delay before starting starting the GUI application, but that didn't seem to help.
Do you have any idea of where to look for relevant errors? /var/log/messages indicates that the app crashed, without any additional information.
The nodes are running on CentOS 7.
Let me know if additional info is needed.
Cheers,
Joey Dumont
Technical Advisor, Knowledge, Information, and Technology Services
National Research Council Canada / Governement of Canada
joey.dumont at nrc-cnrc.gc.ca<mailto:joey.dumont at nrc-cnrc.gc.ca> / Tel: 613-990-8152 / Cell: 438-340-7436
Conseiller technique, Services du savoir, de l'information et de la technologie
Conseil national de recherches Canada / Gouvernement du Canada
joey.dumont at nrc-cnrc.gc.ca<mailto:joey.dumont at nrc-cnrc.gc.ca> / Tél.: 613-990-8152 / Tél. cell.: 438-340-7436
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200609/afa59d33/attachment-0001.htm>
More information about the slurm-users
mailing list