On a single Rocky8 workstation with one GPU where we wanted ssh interactive logins to it to have a small portion of its resources (shell, compiling, simple data manipulations, console desktop, etc) and the rest for SLURM we did this:
- Set it to use cgroupv2 * modify /etc/defaultg/grub to add systemd.unified_cgroup_hierarchy=1 to GRUB_CMDLINE_LINUX. Remake grub with grub2-mkconfig * create file /usr/etc/cgroup_cpuset_init with the lines
#!/bin/bash echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control echo "+cpuset" >> /sys/fs/cgroup/system.slice/cgroup.subtree_control
* Modify/create /etc/systemd/system/slurmd.service.d/override.conf so it has:
[Service] ExecStartPre=-/usr/etc/cgroup_cpuset_init
- figure out exact cores to use for "free user" use and cores for SLURM. Also use GPU sharding in SLURM so GPU can be shared.
* install hwloc-ls * run 'hwloc-ls' to tranlate physical cores 0-9 to logical cores For me P 0-9 was Logical 0,2,4,6,8,10,12,14,16,18 * in /etc/slurm.conf the NodeName definition has
CPUs=128 Boards=1 SocketsPerBoard=1 CoresPerSocket=64 ThreadsPerCore=2 \ RealMemory=257267 MemSpecLimit=20480 \ CpuSpecList=0,2,4,6,8,10,12,14,16,18 \ TmpDisk=6000000 Gres=gpu:nvidia_a2:1,shard:nvidia_a2:32
reserving those 10 cores and 20GB of RAM for "free user"
* gres.conf has the lines:
AutoDetect=nvml Name=shard Count=32
* Need to add gres/shard to GresTypes= too. Job submissions use the option --gres=shard:N where N is less than 32
- Set up systemd to restrict "free users" to cores 0-9 and the 20GB
* Run: systemctl set-property user.slice MemoryHigh=20480M * Run for every individual user on the system
systemctl set-property user-$uid.slice AllowedCPUs=0-9
where $uid is that users user ID. We do this in a script that also runs sacctmgr to add them to the SLURM system
I could not just set this one for user.slice itself which is what I first tried because it then restricted the root user too and that cause wierd behavior with a lot of system tools. So far the root/daemon process work fine in the 20GB limit though so that MemoryHigh=20480M is one and done
Then reboot.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.