slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of JobId=530 returned an error: Only batch jobs are accepted or processed
We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp srun with --x11 was working before changing this. We're on RHEL 9. slurmctld logs show this whenever --x11 is used with srun: [2024-02-23T20:22:43.442] [529.extern] error: setup_x11_forward: failed to create temporary XAUTHORITY file: Permission denied [2024-02-23T20:22:43.442] [529.extern] error: x11 port forwarding setup failed [2024-02-23T20:22:43.442] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable [2024-02-23T20:22:43.443] Could not launch job 529 and not able to requeue it, cancelling job [2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward: failed to create temporary XAUTHORITY file: Permission denied [2024-02-23T20:26:15.881] [530.extern] error: x11 port forwarding setup failed [2024-02-23T20:26:15.882] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable [2024-02-23T20:26:15.883] Could not launch job 530 and not able to requeue it, cancelling job slurmd log entries from a node: [2024-02-23T20:26:15.859] sched: _slurm_rpc_allocate_resources JobId=530 NodeList=2402-node005 usec=1800 [2024-02-23T20:26:15.882] _slurm_rpc_requeue: Requeue of JobId=530 returned an error: Only batch jobs are accepted or processed [2024-02-23T20:26:15.883] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=530 uid 0 [2024-02-23T20:26:15.962] _slurm_rpc_complete_job_allocation: JobId=530 error Job/step already completing or completed srun -v --pty -t 0-4:00 --x11 --mem=10g srun: defined options srun: -------------------- -------------------- srun: account : me srun: mem : 10G srun: nodelist : our-node srun: pty : srun: time : 04:00:00 srun: verbose : 1 srun: x11 : all srun: -------------------- -------------------- srun: end of defined options srun: Waiting for resource configuration srun: error: Nodes our-node are still not ready srun: error: Something is wrong with the boot of the nodes. slurm.conf has PrologFlags=x11 set. /usr/bin/xauth is installed on each compute node. Is this a known issue with zram or is that just a red herring and there's something else wrong?
Hi Robert, On 2/23/24 17:38, Robert Kudyba via slurm-users wrote:
We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp [...] [2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward: failed to create temporary XAUTHORITY file: Permission denied
Where do you set the permissions on /tmp ? What do you set them to? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
<<<Where do you set the permissions on /tmp ? What do you set them to?<<< For now I just set it to chmod 777 on /tmp and that fixed the errors. Is there a better option? On Sat, Feb 24, 2024, 2:19 AM Christopher Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi Robert,
On 2/23/24 17:38, Robert Kudyba via slurm-users wrote:
We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp [...] [2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward: failed to create temporary XAUTHORITY file: Permission denied
Where do you set the permissions on /tmp ? What do you set them to?
All the best, Chris -- Chris Samuel : https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_&d=DwICA... : Berkeley, CA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:
For now I just set it to chmod 777 on /tmp and that fixed the errors. Is there a better option?
Traditionally /tmp and /var/tmp have been 1777 (that "1" being the sticky bit, originally invented to indicate that the OS should attempt to keep a frequently used binary in memory but then adopted to indicate special handling of a world writeable directory so users can only unlink objects they own and not others). Hope that helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
<<<Traditionally /tmp and /var/tmp have been 1777<<< Ah yes thanks for pointing that out. Hope this helps someone down the line...perhaps the error detection could be more explicit in slurmctld? On Sat, Feb 24, 2024, 12:07 PM Chris Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote:
On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:
For now I just set it to chmod 777 on /tmp and that fixed the errors. Is there a better option?
Traditionally /tmp and /var/tmp have been 1777 (that "1" being the sticky bit, originally invented to indicate that the OS should attempt to keep a frequently used binary in memory but then adopted to indicate special handling of a world writeable directory so users can only unlink objects they own and not others).
Hope that helps!
All the best, Chris -- Chris Samuel : https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_&d=DwICA... : Berkeley, CA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Now what would be causing this? The srun just hangs and these are the only logs from slurmctld: [2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node node007 [2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node node006 [2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node node005 [2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node node009 [2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node node008 [2024-02-24T23:43:21.183] _slurm_rpc_complete_job_allocation: JobId=563 error Job/step already completing or completed [465.extern] error: common_file_write_content: unable to open '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_463/step_extern/user/cgroup.freeze' for writing: Permission denied On Sat, Feb 24, 2024 at 12:09 PM Robert Kudyba <rkudyba@fordham.edu> wrote:
<<<Traditionally /tmp and /var/tmp have been 1777<<<
Ah yes thanks for pointing that out. Hope this helps someone down the line...perhaps the error detection could be more explicit in slurmctld?
On Sat, Feb 24, 2024, 12:07 PM Chris Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote:
On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:
For now I just set it to chmod 777 on /tmp and that fixed the errors. Is there a better option?
Traditionally /tmp and /var/tmp have been 1777 (that "1" being the sticky bit, originally invented to indicate that the OS should attempt to keep a frequently used binary in memory but then adopted to indicate special handling of a world writeable directory so users can only unlink objects they own and not others).
Hope that helps!
All the best, Chris -- Chris Samuel : https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_&d=DwICA... : Berkeley, CA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
participants (3)
-
Chris Samuel -
Christopher Samuel -
Robert Kudyba