[slurm-users] Specify a gpu ID

Fuzzy Rogers fuz at ucsb.edu
Fri Jun 4 18:43:13 UTC 2021


My only thought here that is a little off-kilter would be to get a stupid do-nothing job assigned to the failing GPU for 100,000 hours… It might take a bit of work - and some to and fro- but “fake occupy” the failing GPU and every other job will maneuver around it.

Again - it’s not a great solution, but I think it would work.

Take care,

Fuzzy Rogers
(he, his)
Research Computing Administrator
Materials Research Laboratory
Santa Barbara, CA  93106-5121




> On Jun 4, 2021, at 11:35 AM, Jason Simms <simmsj at lafayette.edu> wrote:
> 
> You don't need to chide me for making what is, to me, a reasonable solution. *You* may not be able to make hardware changes, but why the people who can would want failing GPUs remaining in a system is anathema to my approach to cluster management. In other words, I do not recommend you try to find a workaround to a solution that, in my opinion, is best solved by eliminating the faulty hardware. I understand the impulse, and if there is a simple solution to specifying a specific GPU, then fine, do that. But again it goes against treating such resources as generic - nodes and hardware should be thought of as cattle, not pets, and should be managed accordingly. Again, I believe you are trying to solve a problem that should not be yours to solve. Sorry if this irritates you.
> 
> JLS
> 
> On Fri, Jun 4, 2021 at 2:17 PM Ahmad Khalifa <underoath006 at gmail.com <mailto:underoath006 at gmail.com>> wrote:
> I can't make hardware changes, but I still want to make use of the cluster. Let's keep the discussion on how to get slurm to do it, if that's possible. 
> 
> On Fri, Jun 4, 2021 at 11:13 AM Jason Simms <simmsj at lafayette.edu <mailto:simmsj at lafayette.edu>> wrote:
> Unpopular opinion: remove the failing GPU.
> 
> JLS
> 
> On Fri, Jun 4, 2021 at 2:07 PM Ahmad Khalifa <underoath006 at gmail.com <mailto:underoath006 at gmail.com>> wrote:
> Because there are failing GPUs that I'm trying to avoid. 
> 
> On Fri, Jun 4, 2021 at 5:04 AM Stephan Roth <stephan.roth at ee.ethz.ch <mailto:stephan.roth at ee.ethz.ch>> wrote:
> On 03.06.21 07:11, Ahmad Khalifa wrote:
> > How to send a job to a particular gpu card using its ID (0,1,2...etc)?
> 
> Why do you need to access a GPU based on its ID?
> 
> If its to select a certain GPU type, there are other methods you can use.
> 
> You could create partitions for the same GPU types or add features.
> Due to our heterogenous nodes with mixed GPU types we do the latter, we 
> added a feature for the GPU architectures and one for the GPU types to 
> each node.
> 
> Cheers,
> Stephan
> 
> 
> 
> -- 
> Jason L. Simms, Ph.D., M.P.H.
> Manager of Research and High-Performance Computing
> XSEDE Campus Champion
> Lafayette College
> Information Technology Services
> 710 Sullivan Rd | Easton, PA 18042
> Office: 112 Skillman Library
> p: (610) 330-5632
> 
> 
> -- 
> Jason L. Simms, Ph.D., M.P.H.
> Manager of Research and High-Performance Computing
> XSEDE Campus Champion
> Lafayette College
> Information Technology Services
> 710 Sullivan Rd | Easton, PA 18042
> Office: 112 Skillman Library
> p: (610) 330-5632

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210604/14d2e5b3/attachment-0001.htm>


More information about the slurm-users mailing list