[slurm-users] About x11 support
Tina Friedrich
tina.friedrich at it.ox.ac.uk
Mon Nov 19 03:01:32 MST 2018
Hello,
two things; you don't actually seem to have the '--x11' flag on your
srun command? I.e. does 'srun --x11 --nodelist=compute-0-5 -n 1 -c 6
--mem=8G -A y8 -p RUBY xclock' get you any further?
I had some trouble getting the inbuild X forwarding to work, which had
to do with hostnames & xauth magic cookies.
If you do something like
srun --x11 --pty /bin/bash
to just get an interactive session, and then run
xauth list | grep $(hostname)
(note: $(hostname) - not $HOSTNAME - you want the local hostname)
does that find a ticket for your session, i.e. does it print anything?
If it does, you should be good; try running 'xclock' or something from
that session. Needless to say, if you haven't got a magic cookie, it
won't work.
Tina
On 17/11/2018 17:24, Mahmood Naderan wrote:
> >What does this command say?
> >scontrol show config | fgrep PrologFlags
>
> [root at rocks7 ~]# scontrol show config | fgrep PrologFlags
> PrologFlags = Alloc,Contain,X11
>
> That means x11 has been compiled in the code (while Werner created the
> roll).
>
>
>
>
>>Check your slurmd logs on the compute node. What errors are there?
>
> In one terminal, I run the following command
>
> [mahmood at rocks7 ~]$ srun --nodelist=compute-0-5 -n 1 -c 6 --mem=8G -A y8
> -p RUBY xclock
> Error: Can't open display :1
> srun: error: compute-0-5: task 0: Exited with exit code 1
>
> At the same time, in another terminal I see this
>
> [root at compute-0-5 ~]# tail -f /var/log/slurm/slurmd.log
> [2018-11-17T20:47:23.017] _run_prolog: run job script took usec=4
> [2018-11-17T20:47:23.017] _run_prolog: prolog with lock for job 1580 ran
> for 0 seconds
> [2018-11-17T20:47:23.131] launch task 1580.0 request from UID:1000
> GID:1000 HOST:10.1.1.1 PORT:54950
> [2018-11-17T20:47:23.131] lllp_distribution jobid [1580] implicit auto
> binding: sockets,one_thread, dist 1
> [2018-11-17T20:47:23.131] _task_layout_lllp_cyclic
> [2018-11-17T20:47:23.131] _lllp_generate_cpu_bind jobid [1580]:
> mask_cpu,one_thread, 0x00000070000007
> [2018-11-17T20:47:23.204] [1580.0] task_p_pre_launch: Using
> sched_affinity for tasks
> [2018-11-17T20:47:23.231] [1580.0] done with job
> [2018-11-17T20:47:23.263] [1580.extern] done with job
> ^C
>
>
>
> Also, at the same time, I see this in the frontend log
>
> [root at rocks7 ~]# tail -f /var/log/slurm/slurmctld.log
> [2018-11-17T20:52:10.908] Fairhare priority of job 1582 for user mahmood
> in acct y8 is 0.242424
> [2018-11-17T20:52:10.908] Weighted Age priority is 0.000000 * 10 = 0.00
> [2018-11-17T20:52:10.908] Weighted Fairshare priority is 0.242424 *
> 10000 = 2424.24
> [2018-11-17T20:52:10.908] Weighted JobSize priority is 0.097756 * 100 = 9.78
> [2018-11-17T20:52:10.908] Weighted Partition priority is 0.001000 *
> 10000 = 10.00
> [2018-11-17T20:52:10.908] Weighted QOS priority is 0.000000 * 0 = 0.00
> [2018-11-17T20:52:10.908] Weighted TRES:cpu is 0.041667 * 2000.00 = 83.33
> [2018-11-17T20:52:10.908] Weighted TRES:mem is 0.031884 * 1.00 = 0.03
> [2018-11-17T20:52:10.908] Job 1582 priority: 0.00 + 2424.24 + 9.78 +
> 10.00 + 0.00 + 83 - 0 = 2527.38
> [2018-11-17T20:52:10.909] BillingWeight: JobId=1582 is either new or it
> was resized
> [2018-11-17T20:52:10.909] sched: _slurm_rpc_allocate_resources
> JobId=1582 NodeList=compute-0-5 usec=977
> [2018-11-17T20:52:11.123] _job_complete: JobId=1582 WEXITSTATUS 1
> [2018-11-17T20:52:11.123] priority_p_job_end: called for job 1582
> [2018-11-17T20:52:11.123] job 1582 ran for 1 seconds with TRES counts of
> [2018-11-17T20:52:11.123] TRES cpu: 6
> [2018-11-17T20:52:11.123] TRES mem: 8192
> [2018-11-17T20:52:11.123] TRES node: 1
> [2018-11-17T20:52:11.123] TRES billing: 6
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed
> 15552000 unused seconds from QOS normal TRES cpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed
> 21233664000 unused seconds from QOS normal TRES mem
> grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed
> 2592000 unused seconds from QOS normal TRES node grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed
> 15552000 unused seconds from QOS normal TRES billing
> grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed 0
> unused seconds from QOS normal TRES fs/disk grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed 0
> unused seconds from QOS normal TRES vmem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed 0
> unused seconds from QOS normal TRES pages grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed 0
> unused seconds from QOS normal TRES gres/gpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] Adding 5.999997 new usage to assoc 42
> (y8/mahmood/ruby) raw usage is now 437603.824918. Group wall added
> 0.999999 making it 72831.944878.
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed
> 15552000 unused seconds from assoc 42 TRES cpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed
> 21233664000 unused seconds from assoc 42 TRES mem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed
> 2592000 unused seconds from assoc 42 TRES node grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed
> 15552000 unused seconds from assoc 42 TRES billing
> grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed
> 0 unused seconds from assoc 42 TRES fs/disk grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed
> 0 unused seconds from assoc 42 TRES vmem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed
> 0 unused seconds from assoc 42 TRES pages grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed
> 0 unused seconds from assoc 42 TRES gres/gpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] Adding 5.999997 new usage to assoc 41
> (y8/(null)/(null)) raw usage is now 28311279.361228. Group wall added
> 0.999999 making it 1466496.669595.
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed
> 15552000 unused seconds from assoc 41 TRES cpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed
> 21233664000 unused seconds from assoc 41 TRES mem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed
> 2592000 unused seconds from assoc 41 TRES node grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed
> 15552000 unused seconds from assoc 41 TRES billing
> grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed
> 0 unused seconds from assoc 41 TRES fs/disk grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed
> 0 unused seconds from assoc 41 TRES vmem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed
> 0 unused seconds from assoc 41 TRES pages grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed
> 0 unused seconds from assoc 41 TRES gres/gpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] Adding 5.999997 new usage to assoc 1
> (root/(null)/(null)) raw usage is now 107651994.109022. Group wall
> added 0.999999 making it 4989938.597661.
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed
> 15552000 unused seconds from assoc 1 TRES cpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed
> 21233664000 unused seconds from assoc 1 TRES mem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed
> 2592000 unused seconds from assoc 1 TRES node grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed
> 15552000 unused seconds from assoc 1 TRES billing grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed
> 0 unused seconds from assoc 1 TRES fs/disk grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed
> 0 unused seconds from assoc 1 TRES vmem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed
> 0 unused seconds from assoc 1 TRES pages grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed
> 0 unused seconds from assoc 1 TRES gres/gpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _job_complete: JobId=1582 done
>
>
>
>
>
> All those happened with the following two entries in slurm.conf
>
> PrologFlags=x11
> X11Parameters=local_xauthority
>
>
>
> Regards,
> Mahmood
>
>
>
More information about the slurm-users
mailing list