G'day all,
I've been waiting for node to become idle before upgrading them however some jobs take a long time. If I try to remove all the packages I assume that kills the slurmstep program and with it the job.
Sid
On 8/1/24 02:02, Sid Young via slurm-users wrote:
I've been waiting for node to become idle before upgrading them however some jobs take a long time. If I try to remove all the packages I assume that kills the slurmstep program and with it the job.
Can you be more specific about what you mean by "upgrade"? Which Slurm version are you running? Why would you want to remove all the packages?
For slurmd and slurmstepd the quick and usually OK procedure would be to simply update the RPMs while jobs are running!
There is also a more safe procedure where the nodes are first drained before upgrading slurmd, see the Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrade-slurmd...
IHTH, Ole
Sehr geehrte Absender,
Vielen Dank für Ihre Nachricht. Ich bin momentan nicht erreichbar. E-Mails werden während meiner Abwesenheit nicht automatisch weitergeleitet.
Bei Fragen zu den NHR@KIT Projekten schreiben Sie bitte an die Liste project-review@nhr.kit.edu. Im Bereich wissenschaftliche Software schreiben Sie bitte an scitech-software@scc.kit.edu
Bei technischen Fragen oder Problemen benutzen Sie bitte - je nach HPC-System oder Dienst - entweder das NHR-Supportportal [1] oder das bwSupportPortal [2].
[1] https://support.nhr.kit.eduhttps://support.nhr.kit.edu
[2] https://bw-support.scc.kit.edu/https://bw-support.scc.kit.edu/
Viele Grüße, Brigitte May
Generally speaking as a best practice I’d perform such things with no jobs running, but some upgrades you can allow without it. Upgrading a package, even one which is currently in use by a running job, does not necessarily kill the job. For example, upgrading a shared library won’t kill existing tasks, since they already have an open file handle on the old library version, so they will continue to use it. New processes starting will pick up the new replacement version. Obviously that has some risks, depending on what the job is, especially if the behaviour is different and this isn’t just a bug fix release.
I’ve certainly done some security patches in the past on live systems; for example upgrading openssh. You need to take a risk based approach to it. The lowest risk approach is to submit an exclusive job as root to drain the node, run the update and then reboot it. But you might be waiting a long time, which is unacceptable for high severity security patches. The higher risk is to use some other mechanism to run the update anyway; ansible, dsh, whatever your process is.
Can you cope with the users turning up at your desk with flaming torches and pitchforks if it goes wrong? 😊
Regards,
Tim -- Tim Cutts Scientific Computing Platform Lead AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Cataloguehttps://azcollaboration.sharepoint.com/sites/CMU993 |
From: Sid Young via slurm-users slurm-users@lists.schedmd.com Date: Thursday, 1 August 2024 at 1:04 AM To: Slurm User Community List slurm-users@lists.schedmd.com Subject: [slurm-users] Upgrade node while jobs running G'day all,
I've been waiting for node to become idle before upgrading them however some jobs take a long time. If I try to remove all the packages I assume that kills the slurmstep program and with it the job.
Sid ________________________________
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com
Thanks Tim, that fits with my observations. I will be back on it on the 13th and see what effects upgrading the required RPMs has.
Sid
On Sat, 3 Aug 2024, 01:41 Cutts, Tim, tim.cutts@astrazeneca.com wrote:
Generally speaking as a best practice I’d perform such things with no jobs running, but some upgrades you can allow without it. Upgrading a package, even one which is currently in use by a running job, does not necessarily kill the job. For example, upgrading a shared library won’t kill existing tasks, since they already have an open file handle on the old library version, so they will continue to use it. New processes starting will pick up the new replacement version. Obviously that has some risks, depending on what the job is, especially if the behaviour is different and this isn’t just a bug fix release.
I’ve certainly done some security patches in the past on live systems; for example upgrading openssh. You need to take a risk based approach to it. The lowest risk approach is to submit an exclusive job as root to drain the node, run the update and then reboot it. But you might be waiting a long time, which is unacceptable for high severity security patches. The higher risk is to use some other mechanism to run the update anyway; ansible, dsh, whatever your process is.
Can you cope with the users turning up at your desk with flaming torches and pitchforks if it goes wrong? 😊
Regards,
Tim
--
*Tim Cutts*
Scientific Computing Platform Lead
AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Catalogue https://azcollaboration.sharepoint.com/sites/CMU993 |
*From: *Sid Young via slurm-users slurm-users@lists.schedmd.com *Date: *Thursday, 1 August 2024 at 1:04 AM *To: *Slurm User Community List slurm-users@lists.schedmd.com *Subject: *[slurm-users] Upgrade node while jobs running
G'day all,
I've been waiting for node to become idle before upgrading them however some jobs take a long time. If I try to remove all the packages I assume that kills the slurmstep program and with it the job.
Sid
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge https://www.google.com/maps/search/1+Francis+Crick+Avenue,+Cambridge?entry=gmail&source=g Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com
G'day Sid,
On 7/31/24 5:02 pm, Sid Young via slurm-users wrote:
I've been waiting for node to become idle before upgrading them however some jobs take a long time. If I try to remove all the packages I assume that kills the slurmstep program and with it the job.
Are you looking to do a Slurm upgrade, an OS upgrade, or both?
All the best, Chris