We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp
srun with --x11 was working before changing this. We're on RHEL 9.
slurmctld logs show this whenever --x11 is used with srun: [2024-02-23T20:22:43.442] [529.extern] error: setup_x11_forward: failed to create temporary XAUTHORITY file: Permission denied [2024-02-23T20:22:43.442] [529.extern] error: x11 port forwarding setup failed [2024-02-23T20:22:43.442] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable [2024-02-23T20:22:43.443] Could not launch job 529 and not able to requeue it, cancelling job [2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward: failed to create temporary XAUTHORITY file: Permission denied [2024-02-23T20:26:15.881] [530.extern] error: x11 port forwarding setup failed [2024-02-23T20:26:15.882] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable [2024-02-23T20:26:15.883] Could not launch job 530 and not able to requeue it, cancelling job
slurmd log entries from a node: [2024-02-23T20:26:15.859] sched: _slurm_rpc_allocate_resources JobId=530 NodeList=2402-node005 usec=1800 [2024-02-23T20:26:15.882] _slurm_rpc_requeue: Requeue of JobId=530 returned an error: Only batch jobs are accepted or processed [2024-02-23T20:26:15.883] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=530 uid 0 [2024-02-23T20:26:15.962] _slurm_rpc_complete_job_allocation: JobId=530 error Job/step already completing or completed
srun -v --pty -t 0-4:00 --x11 --mem=10g srun: defined options srun: -------------------- -------------------- srun: account : me srun: mem : 10G srun: nodelist : our-node srun: pty : srun: time : 04:00:00 srun: verbose : 1 srun: x11 : all srun: -------------------- -------------------- srun: end of defined options srun: Waiting for resource configuration srun: error: Nodes our-node are still not ready srun: error: Something is wrong with the boot of the nodes.
slurm.conf has PrologFlags=x11 set. /usr/bin/xauth is installed on each compute node.
Is this a known issue with zram or is that just a red herring and there's something else wrong?