[slurm-users] how can users start their worker daemons using srun?

Mon Aug 27 16:15:55 MDT 2018

Folks,

I am trying to figure out how to advise users on starting worker daemons in their allocations using srun. That is, I want to be able to run “srun foo”, where foo starts some child process and then exits, and the child process(es) persist and wait for work.

Use cases for this include Apache Spark and FUSE mounts. In general, it seems that there are a number of newer computing frameworks that have this model, in particular for the data science space.

We are on Slurm 17.02.10 with the proctrack/cgroup plugin.

I’m using a Python script foo.py to test this (included at end of e-mail). After forking, the parent exits immediately, and the child writes the numbers 1 to 10 at one-second intervals to /tmp/foo, then the word “done”, and then exits.

Desired behavior in a one-node allocation:

$ srun ./foo.py && sleep 12 && cat /tmp/foo
starting cn001.localdomain 79615
0
1
2
3
4
5
6
7
8
9
10
done

Actual behavior:

$ srun ./foo.py && sleep 12 && cat /tmp/foo
starting cn001.localdomain 79615
0

As far as I can tell, what is going on is that when foo.py exits, Slurm concludes that the job step is over and kills the child; see debug log at end of e-mail.

I have considered the following:

(1) Various command line options, all of which have no effect on this: --kill-on-bad-exit=0, --no-kill, --mpi-none, --overcommit, --oversubscribe, --wait=0.

(2) srun --task-prolog=./foo.py true

Instead of killing foo.py’s child, this invocation waits for it to exit. Also, this seems to require a single executable rather than a command line.

One can work around the waiting to exit by putting the entire command in the background, but then subsequent sruns wait until the child completes anyway (with the warning “Job step creation temporarily disabled, retrying”). --overcommit on the 1st, 2nd, or both sruns has no effect.

Recall that for real-world tasks, the child will run indefinitely waiting for work, so we can’t wait for it to finish.

(3) srun sh -c './foo.py && sleep 15' : same behavior as item 2.

(4) Teach Slurm how to deal with the worker daemons somehow.

This doesn’t generalize. We want users to be able to bring whatever compute framework they want, without waiting for Slurm support, so they can innovate faster.

(5) Put the worker daemons in their own job. For example, one could start the Spark worker daemons in one job, with the Spark coordinator daemon and user work submission in a second one-node job.

This doesn’t solve the general use case. For example, in the case of Spark, I’ve a large test suite where starting and stopping a Spark cluster is only one of many tests. For FUSE, which depends on a worker daemon to implement filesystem operations, the mount is there to serve the needs of the rest of the job script.

(6) Change the software to not daemonize. For example, one can start Spark by invoking the .jar files directly, bypassing the daemonizing start script, or in newer versions by setting SPARK_NO_DAEMONIZE=1.

This again doesn’t generalize. I need to be able to support imperfect scientific software as it arrives, without hacking or framework-specific workarounds.

(7) Don’t launch with srun. For example, pdsh can interpret Slurm environment variables and uses SSH to launch tasks on my allocated nodes.

This works, and is what I’m doing currently, but it doesn’t scale. One or two dozen SSH processes on the first node of my allocation are fine, but 1000 or 10,000 aren’t. Also, it’s a kludge since srun is specifically provided and optimized to launch jobs in a Slurm cluster.

My question: Is there any way I can convince Slurm to let a job step’s children keep running beyond the end of the step, and kill them at the end of the job if needed. Or, less preferably, overlap job steps?

Much appreciated,
Reid

Appendix 1: foo.py

#!/usr/bin/env python3

# Try to find a way to run daemons under srun.

import os
import socket
import sys
import time

print("starting %s %d" % (socket.gethostname(), os.getpid()))

# one fork is enough to get killed by Slurm
if (os.fork() > 0): sys.exit(0)

fp = open("/tmp/foo", "w")

fp.truncate()
for i in range(10):
   fp.write("%d\n" % i)
   fp.flush()
   time.sleep(1)

fp.write("done\n")

Appendix 2: error log showing job step cleanup removes the worker daemon

slurmstepd: debug level = 6
slurmstepd: debug:  IO handler started pid=62147
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: debug2: Using gid list sent by slurmd
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: starting 1 tasks
slurmstepd: task 0 (62153) started 2018-08-27T11:03:33
slurmstepd: debug2: Using gid list sent by slurmd
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: debug2: adding task 0 pid 62153 on node 0 to jobacct
slurmstepd: debug:  jobacct_gather_cgroup_cpuacct_attach_task: jobid 206670 stepid 62 taskid 0 max_task_id 0
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm' already exists
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_1001' already exists
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670' already exists
slurmstepd: debug:  jobacct_gather_cgroup_memory_attach_task: jobid 206670 stepid 62 taskid 0 max_task_id 0
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm' already exists
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_1001' already exists
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_1001/job_206670' already exists
slurmstepd: debug2: jag_common_poll_data: 62153 mem size 0 290852 time 0.000000(0+0)
slurmstepd: debug2: _get_sys_interface_freq_line: filename = /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq
slurmstepd: debug2:  cpu 1 freq= 2101000
slurmstepd: debug:  jag_common_poll_data: Task average frequency = 2101000 pid 62153 mem size 0 290852 time 0.000000(0+0)
slurmstepd: debug2: energycounted = 0
slurmstepd: debug2: getjoules_task energy = 0
slurmstepd: debug:  Step 206670.62 memory used:0 limit:251658240 KB
slurmstepd: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
slurmstepd: debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/cpuset' entry '/sys/fs/cgroup/cpuset/slurm/system' properties: No such file or directory
slurmstepd: debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/memory' entry '/sys/fs/cgroup/memory/slurm/system' properties: No such file or directory
slurmstepd: debug:  Sending launch resp rc=0
slurmstepd: debug:  mpi type = (null)
slurmstepd: debug:  [job 206670] attempting to run slurm task_prolog [/opt/slurm/task_prolog]
slurmstepd: debug:  Handling REQUEST_STEP_UID
slurmstepd: debug:  Handling REQUEST_SIGNAL_CONTAINER
slurmstepd: debug:  _handle_signal_container for step=206670.62 uid=0 signal=995
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 16384
slurmstepd: debug2: _set_limit: RLIMIT_RSS    : max:inf cur:inf req:257698037760
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_RSS succeeded
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NPROC no change in value: 8192
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NOFILE no change in value: 65536
slurmstepd: debug:  Couldn't find SLURM_RLIMIT_MEMLOCK in environment
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615
slurmstepd: debug2: Set task rss(245760 MB)
starting fg001.localdomain 62153
slurmstepd: debug:  Step 206670.62 memory used:0 limit:251658240 KB
slurmstepd: debug2: removing task 0 pid 62153 from jobacct
slurmstepd: task 0 (62153) exited with exit code 0.
slurmstepd: debug:  [job 206670] attempting to run slurm task_epilog [/opt/slurm/task_epilog]
slurmstepd: debug2: Using gid list sent by slurmd
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670/step_62/task_0): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670/step_62/task_0 Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670/step_62): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670 Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001 Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62/task_0): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62/task_0 Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62 Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_1001/job_206670): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_1001/job_206670 Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_1001): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_1001 Device or resource busy
slurmstepd: debug2: step_terminate_monitor will run for 60 secs
slurmstepd: debug2: killing process 62158 (inherited_task) with signal 9
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_1001/job_206670/step_62): Device or resource busy
slurmstepd: debug:  _slurm_cgroup_destroy: problem deleting step cgroup path /sys/fs/cgroup/freezer/slurm/uid_1001/job_206670/step_62: Device or resource busy
slurmstepd: debug2: killing process 62158 (inherited_task) with signal 9
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_1001/job_206670): Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_1001): Device or resource busy
slurmstepd: debug:  step_terminate_monitor_stop signalling condition
slurmstepd: debug2: step_terminate_monitor is stopping
slurmstepd: debug2: Sending SIGKILL to pgid 62147
slurmstepd: debug:  Waiting for IO
slurmstepd: debug:  Closing debug channel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180827/ed9710b3/attachment-0001.html>