[slurm-users] srun --x11 connection rejected because of wrong authentication

Mon Jun 11 12:05:46 MDT 2018

Hi Hadrian,

Thank you, unfortunately that is not the issue. We can connect to the nodes outside of slurm and have the X11 stuff work properly.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167

On 6/7/18, 6:49 PM, "slurm-users on behalf of Hadrian Djohari" <slurm-users-bounces at lists.schedmd.com on behalf of hxd58 at case.edu> wrote:

    Hi,

    I do not remember whether we had the same error message.
    But, if the user's known_host has an old entry of the node he is trying to connect, the x11 won't connect properly.
    Once the known_host entry has been deleted, the x11 connects just fine.

    Hadrian

    On Thu, Jun 7, 2018 at 6:26 PM, Christopher Benjamin Coffey
    <Chris.Coffey at nau.edu> wrote:

    Hi,

    I've compiled slurm 17.11.7 with x11 support. We can ssh to a node from the login node and get xeyes to work, etc. However, srun --x11 xeyes results in:

    [cbc at wind ~ ]$ srun --x11 --reservation=root_58 xeyes
    X11 connection rejected because of wrong authentication.
    Error: Can't open display: localhost:60.0
    srun: error: cn100: task 0: Exited with exit code 1

    On the node in slurmd.log it says:

    [2018-06-07T15:04:29.932] _run_prolog: run job script took usec=1
    [2018-06-07T15:04:29.932] _run_prolog: prolog with lock for job 11806306 ran for 0 seconds
    [2018-06-07T15:04:29.957] [11806306.extern] task/cgroup: /slurm/uid_3301/job_11806306: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
    [2018-06-07T15:04:29.957] [11806306.extern] task/cgroup: /slurm/uid_3301/job_11806306/step_extern: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
    [2018-06-07T15:04:30.138] [11806306.extern] X11 forwarding established on DISPLAY=cn100:60.0
    [2018-06-07T15:04:30.239] launch task 11806306.0 request from 
    3301.3302 at 172.16.3.21 <mailto:3301.3302 at 172.16.3.21> (port 32453)
    [2018-06-07T15:04:30.240] lllp_distribution jobid [11806306] implicit auto binding: cores,one_thread, dist 1
    [2018-06-07T15:04:30.240] _task_layout_lllp_cyclic 
    [2018-06-07T15:04:30.240] _lllp_generate_cpu_bind jobid [11806306]: mask_cpu,one_thread, 0x0000001
    [2018-06-07T15:04:30.268] [11806306.0] task/cgroup: /slurm/uid_3301/job_11806306: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
    [2018-06-07T15:04:30.268] [11806306.0] task/cgroup: /slurm/uid_3301/job_11806306/step_0: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
    [2018-06-07T15:04:30.303] [11806306.0] task_p_pre_launch: Using sched_affinity for tasks
    [2018-06-07T15:04:30.310] [11806306.extern] error: _handle_channel: remote disconnected
    [2018-06-07T15:04:30.310] [11806306.extern] error: _handle_channel: exiting thread
    [2018-06-07T15:04:30.376] [11806306.0] done with job
    [2018-06-07T15:04:30.413] [11806306.extern] x11 forwarding shutdown complete
    [2018-06-07T15:04:30.443] [11806306.extern] _oom_event_monitor: oom-kill event count: 1
    [2018-06-07T15:04:30.508] [11806306.extern] done with job

    It seems like its close, as srun, and the node can agree on the port to connect on, but there is a difference between slurmd specifying the node name and port, where srun is trying to connect via localhost and the same port. Maybe I have an ssh setting wrong
     somewhere? I've tried all combinations I believe in ssh_config, and sshd_config. No issues with /home either, it’s a shared filesystem that each node mounts, and we even tried no_root_squash so root can write to the .Xauthority file like some folks have suggested.

    Also, xauth list shows that there was no magic cookie written for host cn100:

    [cbc at wind ~ ]$ xauth list
    wind.hpc.nau.edu/unix:14 <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwind.hpc.nau.edu%2Funix%3A14&data=02%7C01%7Cchris.coffey%40nau.edu%7Cff0e3e30539f4411850908d5cce220a0%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636640193976928475&sdata=7RP3G%2FgProB9cc00B7XSeqRK12OGgmHYsbMRx4jBJs4%3D&reserved=0> 
     MIT-MAGIC-COOKIE-1  ac4a0f1bfe9589806f81dd45306ee33d

    Something preventing root from writing the magic cookie? The file is definitely writeable:

    [root at cn100 ~]# touch /home/cbc/.Xauthority 
    [root at cn100 ~]#

    Anyone have any ideas? Thanks!

    Best,
    Chris

    —
    Christopher Coffey
    High-Performance Computing
    Northern Arizona University
    928-523-1167

    -- 
    Hadrian Djohari
    Manager of Research Computing Services, [U]Tech
    Case Western Reserve University
    (W): 216-368-0395
    (M): 216-798-7490