We have a 8 GPU server in which one GPU has gone into an error state that will require a reboot to clear. I have jobs on the server running on good GPUs that will take another 3 days to complete. In the meantime, I would like short jobs to run on the good free GPUs till I reboot.
I set a reservation for the time window I plan to reboot on the whole node with
scontrol create reservation reservationName=rtx-01_reboot users=root starttime=2024-11-25T06:00:00 duration=720 Nodes=rtx-01 flags=maint,ignore_jobs
But I would like to set a reservation on just the bad GPU (gpu_id=7) from now till 2024-11-25T06:00:00 so no job runs that will use it.
Is that possible?
--------------------------------------------------------------- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129 USA
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.