[slurm-users] detectCores() mess

Mike Cammilleri mikec at stat.wisc.edu
Fri Dec 8 10:54:06 MST 2017


Hi,

We have allowed some courses to use our slurm cluster for teaching purposes, which of course leads to all kinds of exciting experiments - not always the most clever programming but it certainly teaches me where we need tighten up configurations.

The default method of thinking for many students just starting out is to grab as much CPU as possible - not fully understanding cluster computing and batch scheduling. One example I see often is students using the R parallel package and calling detectCores(), which of course is returning all the cores linux reports. They also did not specify --ntasks, so slurm assigns 1 of course - but there is no check on the ballooning of R processes created with detectCores() and then whatever they're doing with that number. Now we have overloaded nodes.

I see that availableCores() is suggested as a more friendly method for shared resources like this, where it would return the number of cores that were assigned via SLURM. Therefore, a student using the parallel package would need to explicitly specify the number of cores in their submit file. This would be nice IF students voluntarily used availableCores() instead of detectCores(), but we know that's not really enforceable.

I thought cgroups (which we are using) would prevent some of this behavior on the nodes (we are constraining CPU and RAM) -I'd like there to be no I/O wait times if possible. I would like it if either linux or slurm could constrain a job from grabbing more cores than assigned at submit time. Is there something else I should be configuring to safeguard against this behavior? If SLURM assigns 1 cpu to the task then no matter what craziness is in the code, 1 is all they're getting. Possible?

Thanks for any insight!

--mike





More information about the slurm-users mailing list