Hi, I apologise if I’ve failed to find this in the documentation (and am happy to be told to RTFM) but a recent issue for one of my users resulted in a question I couldn’t answer.
LSF has a feature called a Pre-Exec where a script executes to check whether a node is ready to run a task. So, you can run arbitrary checks and go back to the queue if they fail.
For example, if I have some automounted filesystems, and I want to be able to check for failure of the automounted, in an LSF world, I can do:
bsub -E “test -f /nfs/someplace/file_I_know_exists” my_job.sh
What’s the equivalent in SLURM?
Thanks,
Tim
-- Tim Cutts Scientific Computing Platform Lead AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Cataloguehttps://azcollaboration.sharepoint.com/sites/CMU993 |
________________________________
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com
You probably want the Prolog option: https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog along with: https://slurm.schedmd.com/slurm.conf.html#OPT_ForceRequeueOnFail
-Paul Edmon-
On 2/14/2024 8:38 AM, Cutts, Tim via slurm-users wrote:
Hi, I apologise if I’ve failed to find this in the documentation (and am happy to be told to RTFM) but a recent issue for one of my users resulted in a question I couldn’t answer.
LSF has a feature called a Pre-Exec where a script executes to check whether a node is ready to run a task. So, you can run arbitrary checks and go back to the queue if they fail.
For example, if I have some automounted filesystems, and I want to be able to check for failure of the automounted, in an LSF world, I can do:
bsub -E “test -f /nfs/someplace/file_I_know_exists” my_job.sh
What’s the equivalent in SLURM?
Thanks,
Tim
--
*Tim Cutts*
Scientific Computing Platform Lead
AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting ourService Catalogue https://azcollaboration.sharepoint.com/sites/CMU993|
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com https://www.astrazeneca.com
The Prolog will run with every job, not just "as asked for" by the user. Also it runs as the root or slurm user, not the user who submitted. For that one would use TaskProlog but at that point there is no way to abort or requeue the job I think from TaskProlog
The Prolog script could check for environment var set by user such as SLURM_USER_PROLOG that it will 'su' run as the submitting user if it exists. Then if it returns a non-zero return value, exit and return that value. Even with 'su' there are security issues one has to think through here.
The requeing thing is a bit tricky. I would not necessarily set ForceRequeueOnFail as some Prolog scripts probably really want some jobs just cancelled. Also Prolog will put the node in a drain state which is not necessarily what an admin wants when a user's prolog script fails.
Not sure there is any good way to do this with safe requeing.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Wed, 14 Feb 2024 9:32am, Paul Edmon via slurm-users wrote:
External Email - Use Caution
You probably want the Prolog option: https://secure-web.cisco.com/1gA_zj13OnVqs4BaLrstiwdHEvx0FITE_aDl92-7hACgRFo... along with: https://secure-web.cisco.com/1yoj7-l3lvo6_mD2LfIN7tNcHHzRekef8BenX_pB-l_Y7mz...
-Paul Edmon-
On 2/14/2024 8:38 AM, Cutts, Tim via slurm-users wrote:
Hi, I apologise if I’ve failed to find this in the documentation (and am happy to be told to RTFM) but a recent issue for one of my users resulted in a question I couldn’t answer.
LSF has a feature called a Pre-Exec where a script executes to check whether a node is ready to run a task. So, you can run arbitrary checks and go back to the queue if they fail.
For example, if I have some automounted filesystems, and I want to be able to check for failure of the automounted, in an LSF world, I can do:
bsub -E “test -f /nfs/someplace/file_I_know_exists” my_job.sh
What’s the equivalent in SLURM?
Thanks,
Tim
--
*Tim Cutts*
Scientific Computing Platform Lead
AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting ourService Catalogue https://secure-web.cisco.com/1rZFGGAYuCJMmirDSdCijgYo0A_aAByN6SOBZixUX1qDb_AQrhGzQNOOfxivOQjGgoJQ_3Eqm_BlvSd_99xvFZ3dhHGloY6L4ITdMvmqo5V3Ye9UUtqy5yyYPyNL3bZYq62Bru2u_9cx17-A7smV0ki_kxvPQzgh_zY_aVzr9oQDKFSBuIesGJY6WzLFQUWsMl8o_-8GjfGz-lOf7QVzLM8ztcMhWsdoRg3qA3rxJQKM3WO-9A9Hys1B8fjQm8Xowvab8kzZX7qb1fcySnuMAOo2Ya8A-MKnRn37j4izSFUyORtIHFCzfgpKVoGm5qGGY/https%3A%2F%2Fazcollaboration.sharepoint.com%2Fsites%2FCMU993|
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at http://secure-web.cisco.com/1q7NtvBOcnPasccer2doNzN_s8v1EcsmDX2FxZh2VSwc2uzm... https://secure-web.cisco.com/1xNPy0N6i0blsHKxRwrqo0R1iVZyR41A621xvyePwSoVAl5Tc2ArIZ9NmL29hR1B_q1XOFOnZqSGCai9RYImf1zjIwm39_NKKECz6O377I-r6BL0oFiqz1C6B1xJzdVSObRj6UDy8bamGhiWmDDacDmaZ_oR70hSG6_D5himo4pWc0egrX4eNB433Ojyq0jnHlnpptYP2bL0ZwEQ5-rddJoumT6bWSB9jO16W9EJphvrFuYuL2HrXU0TdV1MW0_hzwCluTHWZu9wQvZx5KaeMNK_opzMNPRdMilX_knuPkNRqwnCIf7pcS1f9Nq_I2qI7/https%3A%2F%2Fwww.astrazeneca.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.