There are scontrol subcommands uhold/hold/release/requeuehold that are ignored when describing how to place a job on hold in FAQ 21; and it is never explained why the method described therein is the best method, it just states it is. Does anyone know why the FAQ method is better than using the subcommands? Is it because the PRIORITY and/or NICE values are not altered (maybe)? The question is also about Running but the answer is just about Starting and not Suspending which is not quite as clear (I think "running" should be "starting" to make that clear; and/or how to suspend should be described as well).
If the answer is not clear to anyone, I might turn this into a request for clarification in the Slurm bugzilla as a documentation change request but wanted to see if this was already clear to anyone and I am missing something.
From FAQ:
21. How can I temporarily prevent a job from running (e.g. place it into a hold state)?
The easiest way to do this is to change a job's earliest begin time (optionally set at job submit time using the --begin option). The example below places a job into hold state (preventing its initiation for 30 days) and later permitting it to start now.
<METHOD I> $ scontrol update JobId=1234 StartTime=now+30days ... later ... $ scontrol update JobId=1234 StartTime=now
Note: Empirically in METHOD I the JobId can be a <job_list> , which I initially thought required single JobIDs.
No explanation is given on why METHOD I is best; and there are other methods that seem more intuitive. I wonder what is undesirable about the following method which I have been using -- using the scontrol(1) subcommands hold/uhold/release/requeuehold.
<METHOD II> $ scontrol hold <job_list> # advantage to administrator as user cannot change $ scontrol uhold <job_list> $ scontrol release <job_list>
Examples: $ scontrol uhold jobname=JOB_NAME $ scontrol uhold '[100-200],300,500'
Using uhold the "Reason" changes to something easily identifying the job is being held, as "Reason=None" became "Reason=JobHeldUser which seems better that Method I in that regard.
The downside might be PRIORITY changed to zero and then went to a very large value when released?
Another method appears to be that setting PRIORITY to zero also places jobs in hold.
<METHOD III> $ scontrol update jobid=373 Priority=0 $ scontrol release jobid=373 # sets to a very high value $ scontrol update jobid=373 Priority=11111 # put back to lower desired value
Once lowered, does an optional setting prevent a user from raising PRIORITY(?) The manual says
Only the Slurm administrator or root can increase job's priority.
At least on my machine the "release" buts the priority to a very high value, and a regular user can lower the value back to the (probably) lower original value.
I did not see it happening but there are some statements in the documentation that make me think not only PRIORITY but perhaps the NICE value might be changed by METHOD II and METHOD III, although I could not get the NICE value to be inadvertently changed.
Sent with [Proton Mail](https://proton.me/) secure email.
IMO the recommended method does not work well for jobs that already have a starttime in the future,and does not change the reason to something that explicitly lets you know the starttime was changed to put the job on hold; so it is problematic to identify jobs and release them as the starttime might have been set for other reasons. So a "magic number" starttime that is easy to identify and not likely to have been an actual value would be useful, instead of something like "now+duration", or additionally setting a comment field indicating the job is being held would help.
I have not used the Priority attribute all that much yet. Is it a bug that releasing a job makes the Priority very high? Do other installations see that behavior? I see several mentions of users only being able to reduce the Priority of their jobs.
Sent with [Proton Mail](https://proton.me/) secure email.
On Saturday, February 24th, 2024 at 9:44 PM, urbanjost via slurm-users slurm-users@lists.schedmd.com wrote:
There are scontrol subcommands uhold/hold/release/requeuehold that are ignored when describing how to place a job on hold in FAQ 21; and it is never explained why the method described therein is the best method, it just states it is. Does anyone know why the FAQ method is better than using the subcommands? Is it because the PRIORITY and/or NICE values are not altered (maybe)? The question is also about Running but the answer is just about Starting and not Suspending which is not quite as clear (I think "running" should be "starting" to make that clear; and/or how to suspend should be described as well).
If the answer is not clear to anyone, I might turn this into a request for clarification in the Slurm bugzilla as a documentation change request but wanted to see if this was already clear to anyone and I am missing something.
From FAQ:
- How can I temporarily prevent a job from running (e.g. place it into a hold state)?
The easiest way to do this is to change a job's earliest begin time (optionally set at job submit time using the --begin option). The example below places a job into hold state (preventing its initiation for 30 days) and later permitting it to start now.
<METHOD I> $ scontrol update JobId=1234 StartTime=now+30days ... later ... $ scontrol update JobId=1234 StartTime=now
Note: Empirically in METHOD I the JobId can be a <job_list> , which I initially thought required single JobIDs.
No explanation is given on why METHOD I is best; and there are other methods that seem more intuitive. I wonder what is undesirable about the following method which I have been using -- using the scontrol(1) subcommands hold/uhold/release/requeuehold.
<METHOD II> $ scontrol hold <job_list> # advantage to administrator as user cannot change $ scontrol uhold <job_list> $ scontrol release <job_list>
Examples: $ scontrol uhold jobname=JOB_NAME $ scontrol uhold '[100-200],300,500'
Using uhold the "Reason" changes to something easily identifying the job is being held, as "Reason=None" became "Reason=JobHeldUser which seems better that Method I in that regard.
The downside might be PRIORITY changed to zero and then went to a very large value when released?
Another method appears to be that setting PRIORITY to zero also places jobs in hold.
<METHOD III> $ scontrol update jobid=373 Priority=0 $ scontrol release jobid=373 # sets to a very high value $ scontrol update jobid=373 Priority=11111 # put back to lower desired value
Once lowered, does an optional setting prevent a user from raising PRIORITY(?) The manual says
Only the Slurm administrator or root can increase job's priority.
At least on my machine the "release" buts the priority to a very high value, and a regular user can lower the value back to the (probably) lower original value.
I did not see it happening but there are some statements in the documentation that make me think not only PRIORITY but perhaps the NICE value might be changed by METHOD II and METHOD III, although I could not get the NICE value to be inadvertently changed.
Sent with [Proton Mail](https://proton.me/) secure email.