[slurm-users] Cannot run interactive jobs
Manalo, Kevin L
kevinlee at gatech.edu
Tue Apr 6 13:55:14 UTC 2021
For those other users that may have run into this. I found a reason why srun cannot run interactive jobs, and it may not necessarily be related to RHEL/CentOS 7
If one straces the slurmd one may see (see arg 3 for gid)
chown("/dev/pts/1", 1326, 7) = -1 EPERM (Operation not permitted)
in my case I had (something similar)
chown("/dev/pts/1", 1326, 0) = -1 EPERM (Operation not permitted)
For our site, this report was also helpful
tty was mapped to group 7 in Sajesh’s case. It (tty) should always be mapped to group 5. At our site, we had a problem with /etc/group being large and the tty group not being properly read in.
The fix for us was to resort the group file by gid, so that the tty line would fall on line 5.
Hope this helps,
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Sajesh Singh <ssingh at amnh.org>
Date: Wednesday, March 25, 2020 at 2:23 AM
To: slurm-users at schedmd.com <slurm-users at schedmd.com>
Subject: [slurm-users] Cannot run interactive jobs
When trying to run an interactive job I am getting the following error:
srun: error: task 0 launch failed: Slurmd could not connect IO
Checking the log file on the compute node I see the following error:
[2020-03-25T01:42:08.262] launch task 13.0 request from UID:1326 GID:50000 HOST:192.168.229.254 PORT:14980
[2020-03-25T01:42:08.262] lllp_distribution jobid  implicit auto binding: cores,one_thread, dist 8192
[2020-03-25T01:42:08.262] _lllp_generate_cpu_bind jobid : mask_cpu,one_thread, 0x0000000000000001
[2020-03-25T01:42:08.262] _run_prolog: run job script took usec=5
[2020-03-25T01:42:08.262] _run_prolog: prolog with lock for job 13 ran for 0 seconds
[2020-03-25T01:42:08.272] [13.0] Considering each NUMA node as a socket
[2020-03-25T01:42:08.310] [13.0] error: stdin openpty: Operation not permitted
[2020-03-25T01:42:08.311] [13.0] error: IO setup failed: Operation not permitted
[2020-03-25T01:42:08.311] [13.0] error: job_manager exiting abnormally, rc = 4021
[2020-03-25T01:42:08.315] [13.0] done with job
When doing the same on a CentOS 7.3 and Slurm 18.08.4 cluster the interactive job runs as expected.
Any advise on how to remedy this would be appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users