External email: Please exercise caution
Hi, Christopher. Thank you for your reply.
I have already modified the cgroup.conf configuration file in Slurm as follows:
vim /etc/slurm/cgroup.conf
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
Then I edited slurm.conf:
vim /etc/slurm/slurm.conf
PrologFlags=CONTAIN
TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup
I restarted both the slurmctld service on the head node and the slurmd service on the compute nodes.
I also set resource limits for the user:
[root@head1 ~]# sacctmgr show assoc format=cluster,account%35,user%35,partition,maxtres%35,GrpCPUs,GrpMem
Cluster Account User Partition MaxTRES GrpCPUs GrpMem
---------- ----------------------------------- ----------------------------------- ---------- ----------------------------------- -------- -------
cluster lyz
cluster lyz lyz gpus=2 80
However, when I specify CUDA device numbers in my .py script, for example:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
def test_gpu():
if torch.cuda.is_available():
torch.cuda.set_device(4)
print("CUDA is available. PyTorch can use GPU.")
num_gpus = torch.cuda.device_count()
print(f"Number of GPUs available: {num_gpus}")
current_device = torch.cuda.current_device()
print(f"Current GPU device: {current_device}")
device_name = torch.cuda.get_device_name(current_device)
print(f"Name of the current GPU device: {device_name}")
x = torch.rand(5, 5).cuda()
print("Random tensor on GPU:")
print(x)
else:
print("CUDA is not available. PyTorch will use CPU.")
time.sleep(1000)
if __name__ == "__main__":
test_gpu()
When I run this script, it still bypasses the resource restrictions set by cgroup.
Are there any other ways to solve this problem?
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com