[slurm-users] About x11 support

Mon Nov 19 03:01:32 MST 2018

Hello,

two things; you don't actually seem to have the '--x11' flag on your 
srun command? I.e. does 'srun --x11 --nodelist=compute-0-5 -n 1 -c 6 
--mem=8G -A y8 -p RUBY xclock' get you any further?

I had some trouble getting the inbuild X forwarding to work, which had 
to do with hostnames & xauth magic cookies.

If you do something like

srun --x11 --pty /bin/bash

to just get an interactive session, and then run

  xauth list | grep $(hostname)

(note: $(hostname) - not $HOSTNAME - you want the local hostname)

does that find a ticket for your session, i.e. does it print anything?

If it does, you should be good; try running 'xclock' or something from 
that session. Needless to say, if you haven't got a magic cookie, it 
won't work.

Tina

On 17/11/2018 17:24, Mahmood Naderan wrote:
>  >What does this command say?
>  >scontrol show config | fgrep PrologFlags
> 
> [root at rocks7 ~]#  scontrol show config | fgrep PrologFlags
> PrologFlags             = Alloc,Contain,X11
> 
> That means x11 has been compiled in the code (while Werner created the 
> roll).
> 
> 
> 
> 
>>Check your slurmd logs on the compute node.  What errors are there?
> 
> In one terminal, I run the following command
> 
> [mahmood at rocks7 ~]$ srun --nodelist=compute-0-5 -n 1 -c 6 --mem=8G -A y8 
> -p RUBY xclock
> Error: Can't open display :1
> srun: error: compute-0-5: task 0: Exited with exit code 1
> 
> At the same time, in another terminal I see this
> 
> [root at compute-0-5 ~]# tail -f /var/log/slurm/slurmd.log
> [2018-11-17T20:47:23.017] _run_prolog: run job script took usec=4
> [2018-11-17T20:47:23.017] _run_prolog: prolog with lock for job 1580 ran 
> for 0 seconds
> [2018-11-17T20:47:23.131] launch task 1580.0 request from UID:1000 
> GID:1000 HOST:10.1.1.1 PORT:54950
> [2018-11-17T20:47:23.131] lllp_distribution jobid [1580] implicit auto 
> binding: sockets,one_thread, dist 1
> [2018-11-17T20:47:23.131] _task_layout_lllp_cyclic
> [2018-11-17T20:47:23.131] _lllp_generate_cpu_bind jobid [1580]: 
> mask_cpu,one_thread, 0x00000070000007
> [2018-11-17T20:47:23.204] [1580.0] task_p_pre_launch: Using 
> sched_affinity for tasks
> [2018-11-17T20:47:23.231] [1580.0] done with job
> [2018-11-17T20:47:23.263] [1580.extern] done with job
> ^C
> 
> 
> 
> Also, at the same time, I see this in the frontend log
> 
> [root at rocks7 ~]# tail -f /var/log/slurm/slurmctld.log
> [2018-11-17T20:52:10.908] Fairhare priority of job 1582 for user mahmood 
> in acct y8 is 0.242424
> [2018-11-17T20:52:10.908] Weighted Age priority is 0.000000 * 10 = 0.00
> [2018-11-17T20:52:10.908] Weighted Fairshare priority is 0.242424 * 
> 10000 = 2424.24
> [2018-11-17T20:52:10.908] Weighted JobSize priority is 0.097756 * 100 = 9.78
> [2018-11-17T20:52:10.908] Weighted Partition priority is 0.001000 * 
> 10000 = 10.00
> [2018-11-17T20:52:10.908] Weighted QOS priority is 0.000000 * 0 = 0.00
> [2018-11-17T20:52:10.908] Weighted TRES:cpu is 0.041667 * 2000.00 = 83.33
> [2018-11-17T20:52:10.908] Weighted TRES:mem is 0.031884 * 1.00 = 0.03
> [2018-11-17T20:52:10.908] Job 1582 priority: 0.00 + 2424.24 + 9.78 + 
> 10.00 + 0.00 + 83 - 0 = 2527.38
> [2018-11-17T20:52:10.909] BillingWeight: JobId=1582 is either new or it 
> was resized
> [2018-11-17T20:52:10.909] sched: _slurm_rpc_allocate_resources 
> JobId=1582 NodeList=compute-0-5 usec=977
> [2018-11-17T20:52:11.123] _job_complete: JobId=1582 WEXITSTATUS 1
> [2018-11-17T20:52:11.123] priority_p_job_end: called for job 1582
> [2018-11-17T20:52:11.123] job 1582 ran for 1 seconds with TRES counts of
> [2018-11-17T20:52:11.123] TRES cpu: 6
> [2018-11-17T20:52:11.123] TRES mem: 8192
> [2018-11-17T20:52:11.123] TRES node: 1
> [2018-11-17T20:52:11.123] TRES billing: 6
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed 
> 15552000 unused seconds from QOS normal TRES cpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed 
> 21233664000 unused seconds from QOS normal TRES mem 
> grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed 
> 2592000 unused seconds from QOS normal TRES node grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed 
> 15552000 unused seconds from QOS normal TRES billing 
> grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed 0 
> unused seconds from QOS normal TRES fs/disk grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed 0 
> unused seconds from QOS normal TRES vmem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed 0 
> unused seconds from QOS normal TRES pages grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_qos_tres_run_secs: job 1582: Removed 0 
> unused seconds from QOS normal TRES gres/gpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] Adding 5.999997 new usage to assoc 42 
> (y8/mahmood/ruby) raw usage is now 437603.824918.  Group wall added 
> 0.999999 making it 72831.944878.
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed 
> 15552000 unused seconds from assoc 42 TRES cpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed 
> 21233664000 unused seconds from assoc 42 TRES mem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed 
> 2592000 unused seconds from assoc 42 TRES node grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed 
> 15552000 unused seconds from assoc 42 TRES billing 
> grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed 
> 0 unused seconds from assoc 42 TRES fs/disk grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed 
> 0 unused seconds from assoc 42 TRES vmem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed 
> 0 unused seconds from assoc 42 TRES pages grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed 
> 0 unused seconds from assoc 42 TRES gres/gpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] Adding 5.999997 new usage to assoc 41 
> (y8/(null)/(null)) raw usage is now 28311279.361228.  Group wall added 
> 0.999999 making it 1466496.669595.
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed 
> 15552000 unused seconds from assoc 41 TRES cpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed 
> 21233664000 unused seconds from assoc 41 TRES mem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed 
> 2592000 unused seconds from assoc 41 TRES node grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.123] _handle_assoc_tres_run_secs: job 1582: Removed 
> 15552000 unused seconds from assoc 41 TRES billing 
> grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed 
> 0 unused seconds from assoc 41 TRES fs/disk grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed 
> 0 unused seconds from assoc 41 TRES vmem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed 
> 0 unused seconds from assoc 41 TRES pages grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed 
> 0 unused seconds from assoc 41 TRES gres/gpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] Adding 5.999997 new usage to assoc 1 
> (root/(null)/(null)) raw usage is now 107651994.109022.  Group wall 
> added 0.999999 making it 4989938.597661.
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed 
> 15552000 unused seconds from assoc 1 TRES cpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed 
> 21233664000 unused seconds from assoc 1 TRES mem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed 
> 2592000 unused seconds from assoc 1 TRES node grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed 
> 15552000 unused seconds from assoc 1 TRES billing grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed 
> 0 unused seconds from assoc 1 TRES fs/disk grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed 
> 0 unused seconds from assoc 1 TRES vmem grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed 
> 0 unused seconds from assoc 1 TRES pages grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _handle_assoc_tres_run_secs: job 1582: Removed 
> 0 unused seconds from assoc 1 TRES gres/gpu grp_used_tres_run_secs = 0
> [2018-11-17T20:52:11.124] _job_complete: JobId=1582 done
> 
> 
> 
> 
> 
> All those happened with the following two entries in slurm.conf
> 
> PrologFlags=x11
> X11Parameters=local_xauthority
> 
> 
> 
> Regards,
> Mahmood
> 
> 
>