We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp
srun with --x11 was working before changing this. We're on RHEL 9.
slurmctld logs show this whenever --x11 is used with srun: [2024-02-23T20:22:43.442] [529.extern] error: setup_x11_forward: failed to create temporary XAUTHORITY file: Permission denied [2024-02-23T20:22:43.442] [529.extern] error: x11 port forwarding setup failed [2024-02-23T20:22:43.442] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable [2024-02-23T20:22:43.443] Could not launch job 529 and not able to requeue it, cancelling job [2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward: failed to create temporary XAUTHORITY file: Permission denied [2024-02-23T20:26:15.881] [530.extern] error: x11 port forwarding setup failed [2024-02-23T20:26:15.882] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable [2024-02-23T20:26:15.883] Could not launch job 530 and not able to requeue it, cancelling job
slurmd log entries from a node: [2024-02-23T20:26:15.859] sched: _slurm_rpc_allocate_resources JobId=530 NodeList=2402-node005 usec=1800 [2024-02-23T20:26:15.882] _slurm_rpc_requeue: Requeue of JobId=530 returned an error: Only batch jobs are accepted or processed [2024-02-23T20:26:15.883] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=530 uid 0 [2024-02-23T20:26:15.962] _slurm_rpc_complete_job_allocation: JobId=530 error Job/step already completing or completed
srun -v --pty -t 0-4:00 --x11 --mem=10g srun: defined options srun: -------------------- -------------------- srun: account : me srun: mem : 10G srun: nodelist : our-node srun: pty : srun: time : 04:00:00 srun: verbose : 1 srun: x11 : all srun: -------------------- -------------------- srun: end of defined options srun: Waiting for resource configuration srun: error: Nodes our-node are still not ready srun: error: Something is wrong with the boot of the nodes.
slurm.conf has PrologFlags=x11 set. /usr/bin/xauth is installed on each compute node.
Is this a known issue with zram or is that just a red herring and there's something else wrong?
Hi Robert,
On 2/23/24 17:38, Robert Kudyba via slurm-users wrote:
We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp
[...]
[2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward:
failed to create temporary XAUTHORITY file: Permission denied
Where do you set the permissions on /tmp ? What do you set them to?
All the best, Chris
<<<Where do you set the permissions on /tmp ? What do you set them to?<<<
For now I just set it to chmod 777 on /tmp and that fixed the errors. Is there a better option?
On Sat, Feb 24, 2024, 2:19 AM Christopher Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi Robert,
On 2/23/24 17:38, Robert Kudyba via slurm-users wrote:
We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp
[...]
[2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward:
failed to create temporary XAUTHORITY file: Permission denied
Where do you set the permissions on /tmp ? What do you set them to?
All the best, Chris -- Chris Samuel : https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_&d=D... : Berkeley, CA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:
For now I just set it to chmod 777 on /tmp and that fixed the errors. Is there a better option?
Traditionally /tmp and /var/tmp have been 1777 (that "1" being the sticky bit, originally invented to indicate that the OS should attempt to keep a frequently used binary in memory but then adopted to indicate special handling of a world writeable directory so users can only unlink objects they own and not others).
Hope that helps!
All the best, Chris
<<<Traditionally /tmp and /var/tmp have been 1777<<<
Ah yes thanks for pointing that out. Hope this helps someone down the line...perhaps the error detection could be more explicit in slurmctld?
On Sat, Feb 24, 2024, 12:07 PM Chris Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote:
On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:
For now I just set it to chmod 777 on /tmp and that fixed the errors. Is there a better option?
Traditionally /tmp and /var/tmp have been 1777 (that "1" being the sticky bit, originally invented to indicate that the OS should attempt to keep a frequently used binary in memory but then adopted to indicate special handling of a world writeable directory so users can only unlink objects they own and not others).
Hope that helps!
All the best, Chris -- Chris Samuel : https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_&d=D... : Berkeley, CA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Now what would be causing this? The srun just hangs and these are the only logs from slurmctld: [2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node node007 [2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node node006 [2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node node005 [2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node node009 [2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node node008
[2024-02-24T23:43:21.183] _slurm_rpc_complete_job_allocation: JobId=563 error Job/step already completing or completed
[465.extern] error: common_file_write_content: unable to open '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_463/step_extern/user/cgroup.freeze' for writing: Permission denied
On Sat, Feb 24, 2024 at 12:09 PM Robert Kudyba rkudyba@fordham.edu wrote:
<<<Traditionally /tmp and /var/tmp have been 1777<<<
Ah yes thanks for pointing that out. Hope this helps someone down the line...perhaps the error detection could be more explicit in slurmctld?
On Sat, Feb 24, 2024, 12:07 PM Chris Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote:
On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:
For now I just set it to chmod 777 on /tmp and that fixed the errors.
Is
there a better option?
Traditionally /tmp and /var/tmp have been 1777 (that "1" being the sticky bit, originally invented to indicate that the OS should attempt to keep a frequently used binary in memory but then adopted to indicate special handling of a world writeable directory so users can only unlink objects they own and not others).
Hope that helps!
All the best, Chris -- Chris Samuel : https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_&d=D... : Berkeley, CA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com