[slurm-users] Using "srun" on compute nodes -- Ray cluster

Reed Dier reed.dier at focusvq.com
Fri Jul 15 19:23:38 UTC 2022


I have some users that are using ray on slurm.
I will preface by saying we are new slurm users, so may not be doing everything exactly correct.

The only issue that we came across so far as something that was somewhat ray specific that we ran into.
Specifically, and pardon my lack of specificity, the ray user I worked on this with is on vacation at the moment, there was an environment variable that needed to be unset so that ray wouldn’t kneecap itself if it hit a cpuset corner case in cgroup fencing.

Specifically, in this workload, the user spawns a “ray head,” and important to mention that this head worker may not have the same resources allocated to it as the “ray worker”.
TL;DR the ray head would be given fewer cpus than the worker(s), and in some corner cases, the worker pid spawned would inherit a smaller cpuset from an environment variable passed from the ray head that is then spawning workers via srun.

The user noticed that some workers would be able to get 100% util for their allocated cpu resources, where other workers running identical workloads would end up at partial usage, which we discovered were due to the cpuset getting inherited in a way we didn’t intend for it to.
I’ll have to follow up with the environment variable we had to unset when that user is back.

But here is my quick and dirty bash script that was able to show the cpu’s allocated to the cgroup, and the pid’s inside the cgroup, which should match, but didn’t always, which was our discovery.
Just use the uid of the user submitting the jobs.

> #!/bin/bash
> UID=$1
> 
> for JOB in $(ls -lah /sys/fs/cgroup/cpuset/slurm/uid_$UID/ | grep job | awk -F'_' '{print $2}' | xargs)
>     do
>         echo "Slurm JobID: “$JOB
>         echo -n "Cgroup CPU set: "
>         cat /sys/fs/cgroup/cpuset/slurm/uid_$UID/job_$JOB/cpuset.cpus
> 
>         for PID in $(cat /sys/fs/cgroup/cpuset/slurm/uid_$UID/job_$JOB/step_0/cgroup.procs | xargs)
>             do
>                 echo -n "CPUs allocated for PID "$PID": "
>                 cat /proc/$PID/status | grep Cpus_allowed_list | awk '{print $2}'
>             done
>         echo ""
>     done


> slurmd3:
>     Slurm Job: 409
>     Cgroup CPU set: 0-7
>     CPUs allocated for PID 7907: 0-7
>     CPUs allocated for PID 7912: 0-3
>     CPUs allocated for PID 7931: 0-3
> slurmd1:
>     Slurm Job: 406
>     Cgroup CPU set: 0-3
>     CPUs allocated for PID 7409: 0-3
>     CPUs allocated for PID 7414: 0-3
>     CPUs allocated for PID 7425: 0-3
> slurmd2:
>     Slurm Job: 408
>     Cgroup CPU set: 0-7
>     CPUs allocated for PID 7491: 0-7
>     CPUs allocated for PID 7496: 0-3
>     CPUs allocated for PID 7515: 0-3

But otherwise, I’ve not had issues with users spawning jobs from within jobs, but I’m not a seasoned slurm admin, so that may not hold up with others.

Reed

> On Jul 15, 2022, at 4:17 AM, Kamil Wilczek <kmwil at mimuw.edu.pl> wrote:
> 
> Dear Slurm Users,
> 
> one of my cluster users would like to run a Ray cluster on Slurm.
> I noticed that the batch script example requires running the "srun"
> command on a compute node, which already is allocated:
> https://docs.ray.io/en/latest/cluster/examples/slurm-template.html#slurm-template
> 
> This is the first time I see or hear about this type of usage
> and I have problems wrapping my head around this.
> Is there anything wrong or unusual about this? I understand that
> this would allocate some resources on other nodes. Would
> Slurm enforce limits properly ("qos" or "partition" limits)?
> 
> Kind Regards
> -- 
> Kamil Wilczek  [https://keys.openpgp.org/]
> [D415917E84B8DA5A60E853B6E676ED061316B69B]

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220715/fe62c12d/attachment-0001.htm>


More information about the slurm-users mailing list