[slurm-users] Drain a single user's jobs

Mark Dixon mark.c.dixon at durham.ac.uk
Wed Apr 1 19:10:25 UTC 2020


Ah-ha! Figured out what I did wrong:

   "sacctmgr modify user foo set qos=drain"

   This set the list of qos available to the user. The user inherited a
   default qos job setting of "normal", which wasn't allowed - hence the
   InvalidQOS.

I needed to override the default qos for foo's jobs:

   "sacctmgr modify user foo set qos=drain defaultqos=drain"

   And then update the qos on all of foo's waiting jobs.

I'll be using David's GrpSubmitJobs=0 suggestion instead.

Thanks for everyone's help,

Mark

On Wed, 1 Apr 2020, Mark Dixon wrote:

> Hi Ahmet,
>
> Another way to do it! Many thanks - very useful :)
>
> But does anyone know why the a user association with my qos stopped jobs 
> running with InvalidQOS?
>
> I can imagine using a user qos to override a partition qos being useful for 
> other things, so would be nice to know what I've done wrong.
>
> Best,
>
> Mark
>
> On Wed, 1 Apr 2020, mercan wrote:
>
>>  Hi;
>>
>>  If you have working job_submit.lua script, you can put a block new jobs of
>>  the spesific user:
>>
>>  if job_desc.user_name == "baduser" then
>>                  return 2045
>>  end
>>
>>  thats all!
>>
>>  Regards;
>>
>>  Ahmet M.
>> 
>>
>>  1.04.2020 16:22 tarihinde Mark Dixon yazdı:
>>>   Hi David,
>>>
>>>   Thanks for this, it sounds like I've not been trying crazy methods - but
>>>   they don't work for me:
>>>
>>>   - "sacctmgr modify user foo set qos=drain" did set up the association
>>>     ("sacctmgr show associations" showed that QoS changed from "normal" to
>>>     "drain"), but this is when foo's jobs refused to start because of
>>>   reason
>>>     "InvalidQOS".
>>>
>>>   - "sacctmgr update user foo set maxsubmitjobs=0" was ignored because qos
>>>     were already set on the partitions.
>>>
>>>   But... good news!
>>>
>>>   We hadn't used GrpSubmitJobs in any of our qos, so "sacctmgr modify user
>>>   foo set GrpSubmitJobs=0" isn't overridden anywhere, and the effect is
>>>   exactly what I wanted - thanks!
>>>
>>>   But if anyone knows why my attempt at using a "drain" qos stopped foo's
>>>   previously submitted jobs from running, I'd be very interested to hear
>>>   about it.
>>>
>>>   Thanks again,
>>>
>>>   Mark
>>>
>>>   On Wed, 1 Apr 2020, David Rhey wrote:
>>>
>>>>   Hi Mark,
>>>>
>>>>   I *think* you might need to update the user account to have access to
>>>>   that
>>>>   QoS (as part of their association). Using sacctmgr modify user <foo> +
>>>>   some
>>>>   additional args (they escape me at the moment).
>>>>
>>>>   Also, you *might* have been able to set the MaxSubmitJobs at their
>>>>   account
>>>>   level to 0 and have them run without having to do the QoS approach -
>>>>   but
>>>>   that's just a guess on my end based on how we've done some things here.
>>>>   We
>>>>   had a "free period" for our clusters and once it was over we set the
>>>>   GrpSubmit jobs on an account to 0 which allowed in-flight jobs to
>>>>   continue
>>>>   but no new work to be submitted.
>>>>
>>>>   HTH,
>>>>
>>>>   David
>>>>
>>>>   On Wed, Apr 1, 2020 at 5:57 AM Mark Dixon <mark.c.dixon at durham.ac.uk>
>>>>   wrote:
>>>>
>>>>>   Hi all,
>>>>>
>>>>>   I'm a slurm newbie who has inherited a working slurm 16.05.10 cluster.
>>>>>
>>>>>   I'd like to stop user foo from submitting new jobs but allow their
>>>>>   existing jobs to run.
>>>>>
>>>>>   We have several partitions, each with its own qos and MaxSubmitJobs
>>>>>   typically set to some vaue. These qos are stopping a "sacctmgr update
>>>>>   user
>>>>>   foo set maxsubmitjobs=0" from doing anything useful, as per the
>>>>>   documentation.
>>>>>
>>>>>   I've tried setting up a competing qos:
>>>>>
>>>>>      sacctmgr add qos drain
>>>>>      sacctmgr modify qos drain set MaxSubmitJobs=0
>>>>>      sacctmgr modify qos drain set flags=OverPartQOS
>>>>>      sacctmgr modify user foo set qos=drain
>>>>>
>>>>>   This has successfully prevented the user from submitting new jobs, but
>>>>>   their existing jobs aren't running. I'm seeing the reason code
>>>>>   "InvalidQOS".
>>>>>
>>>>>   Any ideas what I should be looking at, please?
>>>>>
>>>>>   Thanks,
>>>>>
>>>>>   Mark
>>>>>
>>>>> 
>>>>
>>>>   --
>>>>   David Rhey
>>>>   ---------------
>>>>   Advanced Research Computing - Technology Services
>>>>   University of Michigan
>>>>
>>> 
>>


More information about the slurm-users mailing list