[slurm-users] PrivateData does not filter the billing info "scontrol show assoc_mgr flags=qos"

Fri Aug 20 14:41:55 UTC 2021

Hi Juergen,

 Thanks for the guidance.

 >> is PrivateData also set in your slurmdbd.conf?

No. it is not set in slurmdbd.conf. I will set and verify.

Thanks
Hemanta

On Fri, Aug 20, 2021 at 2:02 PM <slurm-users-request at lists.schedmd.com>
wrote:

> Send slurm-users mailing list submissions to
>         slurm-users at lists.schedmd.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
> or, via email, send a message with subject or body 'help' to
>         slurm-users-request at lists.schedmd.com
>
> You can reach the person managing the list at
>         slurm-users-owner at lists.schedmd.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of slurm-users digest..."
>
>
> Today's Topics:
>
>    1. Re: PrivateData does not filter the billing info "scontrol
>       show assoc_mgr flags=qos" (Juergen Salk)
>    2. Preemption not working for jobs in higher priority partition
>       (Russell Jones)
>    3. GPU jobs not running correctly (Andrey Malyutin)
>    4. Re: GPU jobs not running correctly (Fulcomer, Samuel)
>    5. jobs stuck in "CG" state (Durai Arasan)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 19 Aug 2021 22:51:57 +0200
> From: Juergen Salk <juergen.salk at uni-ulm.de>
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] PrivateData does not filter the billing
>         info "scontrol show assoc_mgr flags=qos"
> Message-ID: <20210819205157.GA1331266 at qualle.rz.uni-ulm.de>
> Content-Type: text/plain; charset=us-ascii
>
> Hi Hemanta,
>
> is PrivateData also set in your slurmdbd.conf?
>
> Best regards
> Juergen
>
>
>
> * Hemanta Sahu <hemantaku.sahu at gmail.com> [210818 15:04]:
> > I am still searching for a solution for this .
> >
> > On Fri, Aug 7, 2020 at 1:15 PM Hemanta Sahu <hemantaku.sahu at gmail.com>
> > wrote:
> >
> > > Hi All,
> > >
> > >   I have configured in our test cluster "PrivateData"   parameter in
> > > "slurm.conf" as below.
> > >
> > > >>
> > > [testuser1 at centos7vm01 ~]$  cat /etc/slurm/clurm.conf|less
> > >
> > >
> PrivateData=accounts,jobs,reservations,usage,users,events,partitions,nodes
> > > MCSPlugin=mcs/user
> > > MCSParameters=enforced,select,privatedata
> > > >>
> > >
> > > The command "scontrol show assoc_mgr flags=Association" filetrs the
> > > relvant information for the user.
> > > But "scontrol show assoc_mgr flags=qos" did not filter anything rather
> it
> > > show the information about all QOS
> > > to the normal users who even don't have privilege of Slurm
> Operator/slurm
> > > Administaror.Basically I want to Hide the billing details to users who
> are
> > > not co-ordinator for a  particular account
> > >
> > >   Appreciate any help or guidance.
> > >
> > > >>
> > > [testuser1 at centos7vm01 ~]$ scontrol show assoc_mgr flags=qos|egrep
> > > "QOS|GrpTRESMins"
> > > QOS Records
> > > QOS=normal(1)
> > >
> > >
> GrpTRESMins=cpu=N(0),mem=N(78),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> > > QOS=testfac1(7)
> > >
> > >
> GrpTRESMins=cpu=N(0),mem=N(143),energy=N(0),node=N(0),billing=6000000(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> > > QOS=cdac_fac1(10)
> > >
> > >
> GrpTRESMins=cpu=N(10),mem=N(163830),energy=N(0),node=N(4),billing=10000000(11),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> > > QOS=iitkgp_fac1(11)
> > >
> > >
> GrpTRESMins=cpu=N(0),mem=N(20899),energy=N(0),node=N(0),billing=10000000(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> > > QOS=iitkgp_faculty(13)
> > >
> > >
> GrpTRESMins=cpu=N(92),mem=N(379873),energy=N(0),node=N(35),billing=N(175),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> > >
> > >
> > > [testuser1 at centos7vm01 ~]$ scontrol show assoc_mgr
> flags=Association|grep
> > > GrpTRESMins
> > >
> > >
> GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> > > [testuser1 at centos7vm01 ~]$
> > > >>
> > >
> > > Regards,
> > > Hemanta
> > >
> > > Hemanta Kumar Sahu
> > > Senior System Engineer
> > > CCDS,JC Bose Annexe
> > > Phone:03222-304604/Ext:84604
> > > I I T Kharagpur-721302
> > > E-Mail: hksahu at iitkgp.ac.in
> > >             hemantaku.sahu at gmail.com
> > >
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 19 Aug 2021 16:49:05 -0500
> From: Russell Jones <arjones85 at gmail.com>
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: [slurm-users] Preemption not working for jobs in higher
>         priority partition
> Message-ID:
>         <CABb1d=hx54=
> jb9UC+zpf3JAe+V5f0wdPMAQD1KU0UEKPDNkRfA at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi all,
>
> I could use some help to understand why preemption is not working for me
> properly. I have a job blocking other jobs that doesn't make sense to me.
> Any assistance is appreciated, thank you!
>
>
> I have two partitions defined in slurm, a day time and a night time
> pariition:
>
> Day partition - PriorityTier of 5, always Up. Limited resources under this
> QOS.
> Night partition - PriorityTier of 5 during night time, during day time set
> to Down and PriorityTier changed to 1. Jobs can be submitted to night queue
> for an unlimited QOS as long as resources are available.
>
> The thought here is jobs can continue to run in the night partition, even
> during the day time, until resources are requested from the day partition.
> Jobs would then be requeued/canceled in the night partition to
> satisfy those requirements.
>
>
>
> Current output of "scontrol show part" :
>
> PartitionName=day
>    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>    AllocNodes=ALL Default=NO QoS=part_day
>    DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
>    MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO
> MaxCPUsPerNode=UNLIMITED
>    Nodes=cluster-r1n[01-13],cluster-r2n[01-08]
>    PriorityJobFactor=1 PriorityTier=5 RootOnly=NO ReqResv=NO
> OverSubscribe=NO
>    OverTimeLimit=NONE PreemptMode=REQUEUE
>    State=UP TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE
>    JobDefaults=(null)
>    DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
>
> PartitionName=night
>    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>    AllocNodes=ALL Default=NO QoS=part_night
>    DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
>    MaxNodes=22 MaxTime=7-00:00:00 MinNodes=0 LLN=NO
> MaxCPUsPerNode=UNLIMITED
>    Nodes=cluster-r1n[01-13],cluster-r2n[01-08]
>    PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
> OverSubscribe=NO
>    OverTimeLimit=NONE PreemptMode=REQUEUE
>    State=DOWN TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE
>    JobDefaults=(null)
>    DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
>
>
>
> I currently have a job in the night partition that is blocking jobs in the
> day partition, even though the day partition has a PriorityTier of 5, and
> night partition is Down with a PriorityTier of 1.
>
> My current slurm.conf preemption settings are:
>
> PreemptMode=REQUEUE
> PreemptType=preempt/partition_prio
>
>
>
> The blocking job's scontrol show job output is:
>
> JobId=105713 JobName=jobname
>    Priority=1986 Nice=0 Account=xxx QOS=normal
>    JobState=RUNNING Reason=None Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    RunTime=17:49:39 TimeLimit=7-00:00:00 TimeMin=N/A
>    SubmitTime=2021-08-18T22:36:36 EligibleTime=2021-08-18T22:36:36
>    AccrueTime=2021-08-18T22:36:36
>    StartTime=2021-08-18T22:36:39 EndTime=2021-08-25T22:36:39 Deadline=N/A
>    PreemptEligibleTime=2021-08-18T22:36:39 PreemptTime=None
>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-18T22:36:39
>    Partition=night AllocNode:Sid=cluster-1:1341505
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=cluster-r1n[12-13],cluster-r2n[04-06]
>    BatchHost=cluster-r1n12
>    NumNodes=5 NumCPUs=80 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=80,node=5,billing=80,gres/gpu=20
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
>
>
>
> The job that is being blocked:
>
> JobId=105876 JobName=bash
>    Priority=2103 Nice=0 Account=xxx QOS=normal
>    JobState=PENDING
>
> Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions
> Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
>    SubmitTime=2021-08-19T16:19:23 EligibleTime=2021-08-19T16:19:23
>    AccrueTime=2021-08-19T16:19:23
>    StartTime=Unknown EndTime=Unknown Deadline=N/A
>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-19T16:26:43
>    Partition=day AllocNode:Sid=cluster-1:2776451
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=(null)
>    NumNodes=3 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=40,node=1,billing=40
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
>
>
>
> Why is the day job not preempting the night job?
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.schedmd.com/pipermail/slurm-users/attachments/20210819/bdecefbc/attachment-0001.htm
> >
>
> ------------------------------
>
> Message: 3
> Date: Thu, 19 Aug 2021 17:35:29 -0700
> From: Andrey Malyutin <malyutinag at gmail.com>
> To: slurm-users at lists.schedmd.com
> Subject: [slurm-users] GPU jobs not running correctly
> Message-ID:
>         <CAGiFTXK6cT=
> MRV2FUEwCCpbvuwTfeoRsmjcJao9ULtfVtuefKA at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hello,
>
> We are in the process of finishing up the setup of a cluster with 3 nodes,
> 4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any job
> asking for 1 GPU in the submission script will wait to run on the 3090
> node, no matter resource availability. Same job requesting 2 or more GPUs
> will run on any node. I don't even know where to begin troubleshooting this
> issue; entries for the 3 nodes are effectively identical in slurm.conf. Any
> help would be appreciated. (If helpful - this cluster is used for
> structural biology, with cryosparc and relion packages).
>
> Thank you,
> Andrey
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.schedmd.com/pipermail/slurm-users/attachments/20210819/10e1b1b7/attachment-0001.htm
> >
>
> ------------------------------
>
> Message: 4
> Date: Thu, 19 Aug 2021 21:05:28 -0400
> From: "Fulcomer, Samuel" <samuel_fulcomer at brown.edu>
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] GPU jobs not running correctly
> Message-ID:
>         <CAOORAuFa+ahMxY--8=a1dVu4cPGUuVSojEDv=Sxg6kfaJLi=
> Zw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> What SLURM version are you running?
>
> What are the #SLURM directives in the batch script? (or the sbatch
> arguments)
>
> When the single GPU jobs are pending, what's the output of 'scontrol show
> job JOBID'?
>
> What are the node definitions in slurm.conf, and the lines in gres.conf?
>
> Are the nodes all the same host platform (motherboard)?
>
> We have P100s, TitanVs, Titan RTXs, Quadro RTX 6000s, 3090s, V100s, DGX 1s,
> A6000s, and A40s, with a mix of single and dual-root platforms, and haven't
> seen this problem with SLURM 20.02.6 or earlier versions.
>
> On Thu, Aug 19, 2021 at 8:38 PM Andrey Malyutin <malyutinag at gmail.com>
> wrote:
>
> > Hello,
> >
> > We are in the process of finishing up the setup of a cluster with 3
> nodes,
> > 4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any job
> > asking for 1 GPU in the submission script will wait to run on the 3090
> > node, no matter resource availability. Same job requesting 2 or more GPUs
> > will run on any node. I don't even know where to begin troubleshooting
> this
> > issue; entries for the 3 nodes are effectively identical in slurm.conf.
> Any
> > help would be appreciated. (If helpful - this cluster is used for
> > structural biology, with cryosparc and relion packages).
> >
> > Thank you,
> > Andrey
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.schedmd.com/pipermail/slurm-users/attachments/20210819/4e2636a0/attachment-0001.htm
> >
>
> ------------------------------
>
> Message: 5
> Date: Fri, 20 Aug 2021 10:31:40 +0200
> From: Durai Arasan <arasan.durai at gmail.com>
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: [slurm-users] jobs stuck in "CG" state
> Message-ID:
>         <CA+WZHCZsT4OiL9p3i9BfYArERYzqhyM9eNrYH=
> cR7cWLEPwcEw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hello!
>
> We have a huge number of jobs stuck in CG state from a user who probably
> wrote code with bad I/O. "scancel" does not make them go away. Is there a
> way for admins to get rid of these jobs without draining and rebooting the
> nodes. I read somewhere that killing the respective slurmstepd process will
> do the job. Is this possible? Any other solutions? Also are there any
> parameters in slurm.conf one can set to manage such situations better?
>
> Best,
> Durai
> MPI T?bingen
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.schedmd.com/pipermail/slurm-users/attachments/20210820/f34971c1/attachment.htm
> >
>
> End of slurm-users Digest, Vol 46, Issue 20
> *******************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210820/aa9c80d8/attachment-0001.htm>