Simple question:
Does FairShare still work if every user is under one account? E.g.:
$ sacctmgr show assoc format=Account,User Account User ---------- ---------- root root root mic mic asmith mic bsmith mic csmith mic djones mic ejones mic frubble
Will it divide time up fairly between the users? I have:
PriorityType=priority/multifactor PriorityFavorSmall=YES PriorityWeightAge=50000 PriorityWeightFairshare=100000 PriorityWeightJobSize=0 PriorityWeightQOS=0
In 21.08.8.
-- Daniel M. Drucker, Ph.D. Director of IT, MGB Imaging at Belmont McLean Hospital, a Harvard Medical School Affiliate
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
I don’t have any 21.08 systems to verify with, but that’s how I remember it. Use “sshare -a -A mic” to verify. You should see both a RawShares and a NormShares column for each user. By default they’ll all have the same value, but they can be adjusted if needed.
From: Drucker, Daniel via slurm-users slurm-users@lists.schedmd.com Date: Friday, August 9, 2024 at 1:39 PM To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] FairShare if there's only one account?
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
________________________________ Simple question:
Does FairShare still work if every user is under one account? E.g.:
$ sacctmgr show assoc format=Account,User Account User ---------- ---------- root root root mic mic asmith mic bsmith mic csmith mic djones mic ejones mic frubble
Will it divide time up fairly between the users? I have:
PriorityType=priority/multifactor PriorityFavorSmall=YES PriorityWeightAge=50000 PriorityWeightFairshare=100000 PriorityWeightJobSize=0 PriorityWeightQOS=0
In 21.08.8.
-- Daniel M. Drucker, Ph.D. Director of IT, MGB Imaging at Belmont McLean Hospital, a Harvard Medical School Affiliate
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
Looks like this:
$ sshare -a -A mic Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- mic 120 0.991736 55524598 1.000000 mic asmith parent 0.991736 2532311 0.045607 0.983871 mic bsmith parent 0.991736 0 0.000000 0.983871 mic csmith parent 0.991736 3265529 0.058805 0.983871 mic djones parent 0.991736 0 0.000000 0.983871 mic ejones parent 0.991736 2210952 0.039820 0.983871 ...etc etc etc...
Does that look right?
On Aug 9, 2024, at 4:05 PM, Renfro, Michael via slurm-users slurm-users@lists.schedmd.com wrote:
External Email - Use Caution
I don’t have any 21.08 systems to verify with, but that’s how I remember it. Use “sshare -a -A mic” to verify. You should see both a RawShares and a NormShares column for each user. By default they’ll all have the same value, but they can be adjusted if needed.
From: Drucker, Daniel via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> Date: Friday, August 9, 2024 at 1:39 PM To: slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> Subject: [slurm-users] FairShare if there's only one account? External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
________________________________ Simple question:
Does FairShare still work if every user is under one account? E.g.:
$ sacctmgr show assoc format=Account,User Account User ---------- ---------- root root root mic mic asmith mic bsmith mic csmith mic djones mic ejones mic frubble
Will it divide time up fairly between the users? I have:
PriorityType=priority/multifactor PriorityFavorSmall=YES PriorityWeightAge=50000 PriorityWeightFairshare=100000 PriorityWeightJobSize=0 PriorityWeightQOS=0
In 21.08.8.
-- Daniel M. Drucker, Ph.D. Director of IT, MGB Imaging at Belmont McLean Hospital, a Harvard Medical School Affiliate
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The format has changed a bit, since none of our RawShares column is ‘parent’.
But you can test this to be certain.
If your cluster already has jobs pending, have bsmith (who has zero usage) and csmith (who has a lot of usage, relatively) each submit several jobs into the pending queue. Alternatively, have bsmith and csmith submit jobs with larger resource requests: jobs that are large enough to automatically go into a pending state due to lack of resources. Those might be jobs that request the whole cluster, even.
bsmith’s jobs should get a higher priority as seen from sprio, and bsmith’s jobs should start earlier than csmith’s.
From: Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU Date: Friday, August 9, 2024 at 3:11 PM To: Renfro, Michael Renfro@tntech.edu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] FairShare if there's only one account?
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
________________________________ Looks like this:
$ sshare -a -A mic Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- mic 120 0.991736 55524598 1.000000 mic asmith parent 0.991736 2532311 0.045607 0.983871 mic bsmith parent 0.991736 0 0.000000 0.983871 mic csmith parent 0.991736 3265529 0.058805 0.983871 mic djones parent 0.991736 0 0.000000 0.983871 mic ejones parent 0.991736 2210952 0.039820 0.983871 ...etc etc etc...
Does that look right?
On Aug 9, 2024, at 4:05 PM, Renfro, Michael via slurm-users slurm-users@lists.schedmd.com wrote:
External Email - Use Caution
I don’t have any 21.08 systems to verify with, but that’s how I remember it. Use “sshare -a -A mic” to verify. You should see both a RawShares and a NormShares column for each user. By default they’ll all have the same value, but they can be adjusted if needed.
From: Drucker, Daniel via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> Date: Friday, August 9, 2024 at 1:39 PM To: slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> Subject: [slurm-users] FairShare if there's only one account? External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
________________________________ Simple question:
Does FairShare still work if every user is under one account? E.g.:
$ sacctmgr show assoc format=Account,User Account User ---------- ---------- root root root mic mic asmith mic bsmith mic csmith mic djones mic ejones mic frubble
Will it divide time up fairly between the users? I have:
PriorityType=priority/multifactor PriorityFavorSmall=YES PriorityWeightAge=50000 PriorityWeightFairshare=100000 PriorityWeightJobSize=0 PriorityWeightQOS=0
In 21.08.8.
-- Daniel M. Drucker, Ph.D. Director of IT, MGB Imaging at Belmont McLean Hospital, a Harvard Medical School Affiliate
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
I got the opposite result. When I submitted a job as bsmith, they got a lower priority (the number was smaller) than the job submitted as csmith.
bsmith (who has never submitted a job before) got a priority of 98387 (which is 10000 times the 0.983871 FairShare), whereas csmith (who is already running a huge number of jobs and has been for days now) got a priority of 103749.
On Aug 9, 2024, at 5:11 PM, Renfro, Michael Renfro@tntech.edu wrote:
External Email - Use Caution
The format has changed a bit, since none of our RawShares column is ‘parent’.
But you can test this to be certain.
If your cluster already has jobs pending, have bsmith (who has zero usage) and csmith (who has a lot of usage, relatively) each submit several jobs into the pending queue. Alternatively, have bsmith and csmith submit jobs with larger resource requests: jobs that are large enough to automatically go into a pending state due to lack of resources. Those might be jobs that request the whole cluster, even.
bsmith’s jobs should get a higher priority as seen from sprio, and bsmith’s jobs should start earlier than csmith’s. The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
This depends on how you have assigned fairshare in sacctmgr when creating the accounts and users. At our site we want fairshare only on accounts and not users, just like you are seeing, so we create accounts with
sacctmgr -i add account $acct Description="$descr" \ fairshare=200 GrpJobsAccrue=8
and users with
sacctmgr -i add user "$u" account=$acct fairshare=parent
If you want users to have their own independent fairshare, you do not use fairshare=parent but assign a real number.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Fri, 9 Aug 2024 5:20pm, Drucker, Daniel via slurm-users wrote:
External Email - Use Caution
I got the opposite result. When I submitted a job as bsmith, they got a lower priority (the number was smaller) than the job submitted as csmith.
bsmith (who has never submitted a job before) got a priority of 98387 (which is 10000 times the 0.983871 FairShare), whereas csmith (who is already running a huge number of jobs and has been for days now) got a priority of 103749.
On Aug 9, 2024, at 5:11 PM, Renfro, Michael Renfro@tntech.edu wrote:
External Email - Use Caution
The format has changed a bit, since none of our RawShares column is ‘parent’.
But you can test this to be certain.
If your cluster already has jobs pending, have bsmith (who has zero usage) and csmith (who has a lot of usage, relatively) each submit several jobs into the pending queue. Alternatively, have bsmith and csmith submit jobs with larger resource requests: jobs that are large enough to automatically go into a pending state due to lack of resources. Those might be jobs that request the whole cluster, even.
bsmith’s jobs should get a higher priority as seen from sprio, and bsmith’s jobs should start earlier than csmith’s. The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
Hi Paul from over at mclean.harvard.eduhttp://mclean.harvard.edu!
I have never added any users using sacctmgr - I've always just had everyone I guess automatically join the default account, mic. Are you saying that is what is causing my problem?
I'm confused I guess because I would have expected that within an account - even if there is only one - users would get their 'fair share' of resources, rather than just defaulting to FIFO or something. But that doesn't seem to be the case.
I do not want any particular user to start out with more priority than any other particular user - I just want to make sure that if user A submits a million jobs at noon, and user B submits one job at 12:01, user B doesn't have to wait until those million jobs finish.
Daniel
On Aug 9, 2024, at 5:47 PM, Paul Raines raines@nmr.mgh.harvard.edu wrote:
This depends on how you have assigned fairshare in sacctmgr when creating the accounts and users. At our site we want fairshare only on accounts and not users, just like you are seeing, so we create accounts with
sacctmgr -i add account $acct Description="$descr" \ fairshare=200 GrpJobsAccrue=8
and users with
sacctmgr -i add user "$u" account=$acct fairshare=parent
If you want users to have their own independent fairshare, you do not use fairshare=parent but assign a real number.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Fri, 9 Aug 2024 5:20pm, Drucker, Daniel via slurm-users wrote:
External Email - Use Caution I got the opposite result. When I submitted a job as bsmith, they got a lower priority (the number was smaller) than the job submitted as csmith.
bsmith (who has never submitted a job before) got a priority of 98387 (which is 10000 times the 0.983871 FairShare), whereas csmith (who is already running a huge number of jobs and has been for days now) got a priority of 103749.
On Aug 9, 2024, at 5:11 PM, Renfro, Michael Renfro@tntech.edu wrote:
External Email - Use Caution
The format has changed a bit, since none of our RawShares column is ‘parent’.
But you can test this to be certain.
If your cluster already has jobs pending, have bsmith (who has zero usage) and csmith (who has a lot of usage, relatively) each submit several jobs into the pending queue. Alternatively, have bsmith and csmith submit jobs with larger resource requests: jobs that are large enough to automatically go into a pending state due to lack of resources. Those might be jobs that request the whole cluster, even.
bsmith’s jobs should get a higher priority as seen from sprio, and bsmith’s jobs should start earlier than csmith’s. The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
I don't think fairshare use is updated until jobs finish...
On Fri, Aug 9, 2024 at 5:59 PM Drucker, Daniel via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi Paul from over at mclean.harvard.edu!
I have never added *any* users using sacctmgr - I've always just had everyone I guess automatically join the default account, *mic*. Are you saying that is what is causing my problem?
I'm confused I guess because I would have expected that *within* an account - even if there is only one - users would get their 'fair share' of resources, rather than just defaulting to FIFO or something. But that doesn't seem to be the case.
I do not want any particular user to start out with more priority than any other particular user - I just want to make sure that if user A submits a million jobs at noon, and user B submits one job at 12:01, user B doesn't have to wait until those million jobs finish.
Daniel
On Aug 9, 2024, at 5:47 PM, Paul Raines raines@nmr.mgh.harvard.edu wrote:
This depends on how you have assigned fairshare in sacctmgr when creating the accounts and users. At our site we want fairshare only on accounts and not users, just like you are seeing, so we create accounts with
sacctmgr -i add account $acct Description="$descr" \ fairshare=200 GrpJobsAccrue=8
and users with
sacctmgr -i add user "$u" account=$acct fairshare=parent
If you want users to have their own independent fairshare, you do not use fairshare=parent but assign a real number.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Fri, 9 Aug 2024 5:20pm, Drucker, Daniel via slurm-users wrote:
External Email - Use Caution
I got the opposite result. When I submitted a job as bsmith, they got a lower priority (the number was smaller) than the job submitted as csmith.
bsmith (who has never submitted a job before) got a priority of 98387 (which is 10000 times the 0.983871 FairShare), whereas csmith (who is already running a huge number of jobs and has been for days now) got a priority of 103749.
On Aug 9, 2024, at 5:11 PM, Renfro, Michael Renfro@tntech.edu wrote:
External Email - Use Caution
The format has changed a bit, since none of our RawShares column is ‘parent’.
But you can test this to be certain.
If your cluster already has jobs pending, have bsmith (who has zero usage) and csmith (who has a lot of usage, relatively) each submit several jobs into the pending queue. Alternatively, have bsmith and csmith submit jobs with larger resource requests: jobs that are large enough to automatically go into a pending state due to lack of resources. Those might be jobs that request the whole cluster, even.
bsmith’s jobs should get a higher priority as seen from sprio, and bsmith’s jobs should start earlier than csmith’s. The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline < https://www.massgeneralbrigham.org/complianceline%3E . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Well, let's say user A has completed a million jobs in the last few days as well, and user A has never submitted any before.
On Aug 9, 2024, at 6:03 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
External Email - Use Caution
I don't think fairshare use is updated until jobs finish...
On Fri, Aug 9, 2024 at 5:59 PM Drucker, Daniel via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> wrote: Hi Paul from over at mclean.harvard.eduhttp://mclean.harvard.edu/!
I have never added any users using sacctmgr - I've always just had everyone I guess automatically join the default account, mic. Are you saying that is what is causing my problem?
I'm confused I guess because I would have expected that within an account - even if there is only one - users would get their 'fair share' of resources, rather than just defaulting to FIFO or something. But that doesn't seem to be the case.
I do not want any particular user to start out with more priority than any other particular user - I just want to make sure that if user A submits a million jobs at noon, and user B submits one job at 12:01, user B doesn't have to wait until those million jobs finish.
Daniel
On Aug 9, 2024, at 5:47 PM, Paul Raines <raines@nmr.mgh.harvard.edumailto:raines@nmr.mgh.harvard.edu> wrote:
This depends on how you have assigned fairshare in sacctmgr when creating the accounts and users. At our site we want fairshare only on accounts and not users, just like you are seeing, so we create accounts with
sacctmgr -i add account $acct Description="$descr" \ fairshare=200 GrpJobsAccrue=8
and users with
sacctmgr -i add user "$u" account=$acct fairshare=parent
If you want users to have their own independent fairshare, you do not use fairshare=parent but assign a real number.
-- Paul Raines (http://help.nmr.mgh.harvard.eduhttp://help.nmr.mgh.harvard.edu/)
On Fri, 9 Aug 2024 5:20pm, Drucker, Daniel via slurm-users wrote:
External Email - Use Caution I got the opposite result. When I submitted a job as bsmith, they got a lower priority (the number was smaller) than the job submitted as csmith.
bsmith (who has never submitted a job before) got a priority of 98387 (which is 10000 times the 0.983871 FairShare), whereas csmith (who is already running a huge number of jobs and has been for days now) got a priority of 103749.
On Aug 9, 2024, at 5:11 PM, Renfro, Michael <Renfro@tntech.edumailto:Renfro@tntech.edu> wrote:
External Email - Use Caution
The format has changed a bit, since none of our RawShares column is ‘parent’.
But you can test this to be certain.
If your cluster already has jobs pending, have bsmith (who has zero usage) and csmith (who has a lot of usage, relatively) each submit several jobs into the pending queue. Alternatively, have bsmith and csmith submit jobs with larger resource requests: jobs that are large enough to automatically go into a pending state due to lack of resources. Those might be jobs that request the whole cluster, even.
bsmith’s jobs should get a higher priority as seen from sprio, and bsmith’s jobs should start earlier than csmith’s. The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
Er, user B has never.
On Aug 9, 2024, at 6:08 PM, Daniel M. Drucker ddrucker@mclean.harvard.edu wrote:
Well, let's say user A has completed a million jobs in the last few days as well, and user A has never submitted any before.
On Aug 9, 2024, at 6:03 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
External Email - Use Caution
I don't think fairshare use is updated until jobs finish...
On Fri, Aug 9, 2024 at 5:59 PM Drucker, Daniel via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> wrote: Hi Paul from over at mclean.harvard.eduhttp://mclean.harvard.edu/!
I have never added any users using sacctmgr - I've always just had everyone I guess automatically join the default account, mic. Are you saying that is what is causing my problem?
I'm confused I guess because I would have expected that within an account - even if there is only one - users would get their 'fair share' of resources, rather than just defaulting to FIFO or something. But that doesn't seem to be the case.
I do not want any particular user to start out with more priority than any other particular user - I just want to make sure that if user A submits a million jobs at noon, and user B submits one job at 12:01, user B doesn't have to wait until those million jobs finish.
Daniel
On Aug 9, 2024, at 5:47 PM, Paul Raines <raines@nmr.mgh.harvard.edumailto:raines@nmr.mgh.harvard.edu> wrote:
This depends on how you have assigned fairshare in sacctmgr when creating the accounts and users. At our site we want fairshare only on accounts and not users, just like you are seeing, so we create accounts with
sacctmgr -i add account $acct Description="$descr" \ fairshare=200 GrpJobsAccrue=8
and users with
sacctmgr -i add user "$u" account=$acct fairshare=parent
If you want users to have their own independent fairshare, you do not use fairshare=parent but assign a real number.
-- Paul Raines (http://help.nmr.mgh.harvard.eduhttp://help.nmr.mgh.harvard.edu/)
On Fri, 9 Aug 2024 5:20pm, Drucker, Daniel via slurm-users wrote:
External Email - Use Caution I got the opposite result. When I submitted a job as bsmith, they got a lower priority (the number was smaller) than the job submitted as csmith.
bsmith (who has never submitted a job before) got a priority of 98387 (which is 10000 times the 0.983871 FairShare), whereas csmith (who is already running a huge number of jobs and has been for days now) got a priority of 103749.
On Aug 9, 2024, at 5:11 PM, Renfro, Michael <Renfro@tntech.edumailto:Renfro@tntech.edu> wrote:
External Email - Use Caution
The format has changed a bit, since none of our RawShares column is ‘parent’.
But you can test this to be certain.
If your cluster already has jobs pending, have bsmith (who has zero usage) and csmith (who has a lot of usage, relatively) each submit several jobs into the pending queue. Alternatively, have bsmith and csmith submit jobs with larger resource requests: jobs that are large enough to automatically go into a pending state due to lack of resources. Those might be jobs that request the whole cluster, even.
bsmith’s jobs should get a higher priority as seen from sprio, and bsmith’s jobs should start earlier than csmith’s. The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
Yes, well, in that case, it should work as you desire, modulo your slurm.conf settings. What are the relevant lines in yours?
On Fri, Aug 9, 2024 at 6:09 PM Drucker, Daniel DDRUCKER@mclean.harvard.edu wrote:
Er, user B has never.
On Aug 9, 2024, at 6:08 PM, Daniel M. Drucker ddrucker@mclean.harvard.edu wrote:
Well, let's say user A has completed a million jobs in the last few days as well, and user A has never submitted any before.
On Aug 9, 2024, at 6:03 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
External Email - Use Caution
I don't think fairshare use is updated until jobs finish...
On Fri, Aug 9, 2024 at 5:59 PM Drucker, Daniel via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi Paul from over at mclean.harvard.edu!
I have never added *any* users using sacctmgr - I've always just had everyone I guess automatically join the default account, *mic*. Are you saying that is what is causing my problem?
I'm confused I guess because I would have expected that *within* an account - even if there is only one - users would get their 'fair share' of resources, rather than just defaulting to FIFO or something. But that doesn't seem to be the case.
I do not want any particular user to start out with more priority than any other particular user - I just want to make sure that if user A submits a million jobs at noon, and user B submits one job at 12:01, user B doesn't have to wait until those million jobs finish.
Daniel
On Aug 9, 2024, at 5:47 PM, Paul Raines raines@nmr.mgh.harvard.edu wrote:
This depends on how you have assigned fairshare in sacctmgr when creating the accounts and users. At our site we want fairshare only on accounts and not users, just like you are seeing, so we create accounts with
sacctmgr -i add account $acct Description="$descr" \ fairshare=200 GrpJobsAccrue=8
and users with
sacctmgr -i add user "$u" account=$acct fairshare=parent
If you want users to have their own independent fairshare, you do not use fairshare=parent but assign a real number.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Fri, 9 Aug 2024 5:20pm, Drucker, Daniel via slurm-users wrote:
External Email - Use Caution
I got the opposite result. When I submitted a job as bsmith, they got a lower priority (the number was smaller) than the job submitted as csmith.
bsmith (who has never submitted a job before) got a priority of 98387 (which is 10000 times the 0.983871 FairShare), whereas csmith (who is already running a huge number of jobs and has been for days now) got a priority of 103749.
On Aug 9, 2024, at 5:11 PM, Renfro, Michael Renfro@tntech.edu wrote:
External Email - Use Caution
The format has changed a bit, since none of our RawShares column is ‘parent’.
But you can test this to be certain.
If your cluster already has jobs pending, have bsmith (who has zero usage) and csmith (who has a lot of usage, relatively) each submit several jobs into the pending queue. Alternatively, have bsmith and csmith submit jobs with larger resource requests: jobs that are large enough to automatically go into a pending state due to lack of resources. Those might be jobs that request the whole cluster, even.
bsmith’s jobs should get a higher priority as seen from sprio, and bsmith’s jobs should start earlier than csmith’s. The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline < https://www.massgeneralbrigham.org/complianceline%3E . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
PriorityType=priority/multifactor PriorityFavorSmall=YES PriorityWeightAge=50000 PriorityWeightFairshare=100000 PriorityWeightJobSize=0 PriorityWeightQOS=0
In 21.08.8.
On Aug 9, 2024, at 8:36 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
External Email - Use Caution
Yes, well, in that case, it should work as you desire, modulo your slurm.conf settings. What are the relevant lines in yours?
On Fri, Aug 9, 2024 at 6:09 PM Drucker, Daniel <DDRUCKER@mclean.harvard.edumailto:DDRUCKER@mclean.harvard.edu> wrote: Er, user B has never.
On Aug 9, 2024, at 6:08 PM, Daniel M. Drucker <ddrucker@mclean.harvard.edumailto:ddrucker@mclean.harvard.edu> wrote:
Well, let's say user A has completed a million jobs in the last few days as well, and user A has never submitted any before.
On Aug 9, 2024, at 6:03 PM, Fulcomer, Samuel <samuel_fulcomer@brown.edumailto:samuel_fulcomer@brown.edu> wrote:
External Email - Use Caution
I don't think fairshare use is updated until jobs finish...
On Fri, Aug 9, 2024 at 5:59 PM Drucker, Daniel via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> wrote: Hi Paul from over at mclean.harvard.eduhttp://mclean.harvard.edu/!
I have never added any users using sacctmgr - I've always just had everyone I guess automatically join the default account, mic. Are you saying that is what is causing my problem?
I'm confused I guess because I would have expected that within an account - even if there is only one - users would get their 'fair share' of resources, rather than just defaulting to FIFO or something. But that doesn't seem to be the case.
I do not want any particular user to start out with more priority than any other particular user - I just want to make sure that if user A submits a million jobs at noon, and user B submits one job at 12:01, user B doesn't have to wait until those million jobs finish.
Daniel
On Aug 9, 2024, at 5:47 PM, Paul Raines <raines@nmr.mgh.harvard.edumailto:raines@nmr.mgh.harvard.edu> wrote:
This depends on how you have assigned fairshare in sacctmgr when creating the accounts and users. At our site we want fairshare only on accounts and not users, just like you are seeing, so we create accounts with
sacctmgr -i add account $acct Description="$descr" \ fairshare=200 GrpJobsAccrue=8
and users with
sacctmgr -i add user "$u" account=$acct fairshare=parent
If you want users to have their own independent fairshare, you do not use fairshare=parent but assign a real number.
-- Paul Raines (http://help.nmr.mgh.harvard.eduhttp://help.nmr.mgh.harvard.edu/)
On Fri, 9 Aug 2024 5:20pm, Drucker, Daniel via slurm-users wrote:
External Email - Use Caution I got the opposite result. When I submitted a job as bsmith, they got a lower priority (the number was smaller) than the job submitted as csmith.
bsmith (who has never submitted a job before) got a priority of 98387 (which is 10000 times the 0.983871 FairShare), whereas csmith (who is already running a huge number of jobs and has been for days now) got a priority of 103749.
On Aug 9, 2024, at 5:11 PM, Renfro, Michael <Renfro@tntech.edumailto:Renfro@tntech.edu> wrote:
External Email - Use Caution
The format has changed a bit, since none of our RawShares column is ‘parent’.
But you can test this to be certain.
If your cluster already has jobs pending, have bsmith (who has zero usage) and csmith (who has a lot of usage, relatively) each submit several jobs into the pending queue. Alternatively, have bsmith and csmith submit jobs with larger resource requests: jobs that are large enough to automatically go into a pending state due to lack of resources. Those might be jobs that request the whole cluster, even.
bsmith’s jobs should get a higher priority as seen from sprio, and bsmith’s jobs should start earlier than csmith’s. The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
...and what are the top 10-15 lines in your share output?...
On Fri, Aug 9, 2024 at 9:07 PM Drucker, Daniel DDRUCKER@mclean.harvard.edu wrote:
PriorityType=priority/multifactor PriorityFavorSmall=YES PriorityWeightAge=50000 PriorityWeightFairshare=100000 PriorityWeightJobSize=0 PriorityWeightQOS=0
In 21.08.8.
On Aug 9, 2024, at 8:36 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
External Email - Use Caution
Yes, well, in that case, it should work as you desire, modulo your slurm.conf settings. What are the relevant lines in yours?
On Fri, Aug 9, 2024 at 6:09 PM Drucker, Daniel < DDRUCKER@mclean.harvard.edu> wrote:
Er, user B has never.
On Aug 9, 2024, at 6:08 PM, Daniel M. Drucker < ddrucker@mclean.harvard.edu> wrote:
Well, let's say user A has completed a million jobs in the last few days as well, and user A has never submitted any before.
On Aug 9, 2024, at 6:03 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
External Email - Use Caution
I don't think fairshare use is updated until jobs finish...
On Fri, Aug 9, 2024 at 5:59 PM Drucker, Daniel via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi Paul from over at mclean.harvard.edu!
I have never added *any* users using sacctmgr - I've always just had everyone I guess automatically join the default account, *mic*. Are you saying that is what is causing my problem?
I'm confused I guess because I would have expected that *within* an account - even if there is only one - users would get their 'fair share' of resources, rather than just defaulting to FIFO or something. But that doesn't seem to be the case.
I do not want any particular user to start out with more priority than any other particular user - I just want to make sure that if user A submits a million jobs at noon, and user B submits one job at 12:01, user B doesn't have to wait until those million jobs finish.
Daniel
On Aug 9, 2024, at 5:47 PM, Paul Raines raines@nmr.mgh.harvard.edu wrote:
This depends on how you have assigned fairshare in sacctmgr when creating the accounts and users. At our site we want fairshare only on accounts and not users, just like you are seeing, so we create accounts with
sacctmgr -i add account $acct Description="$descr" \ fairshare=200 GrpJobsAccrue=8
and users with
sacctmgr -i add user "$u" account=$acct fairshare=parent
If you want users to have their own independent fairshare, you do not use fairshare=parent but assign a real number.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Fri, 9 Aug 2024 5:20pm, Drucker, Daniel via slurm-users wrote:
External Email - Use Caution
I got the opposite result. When I submitted a job as bsmith, they got a lower priority (the number was smaller) than the job submitted as csmith.
bsmith (who has never submitted a job before) got a priority of 98387 (which is 10000 times the 0.983871 FairShare), whereas csmith (who is already running a huge number of jobs and has been for days now) got a priority of 103749.
On Aug 9, 2024, at 5:11 PM, Renfro, Michael Renfro@tntech.edu wrote:
External Email - Use Caution
The format has changed a bit, since none of our RawShares column is ‘parent’.
But you can test this to be certain.
If your cluster already has jobs pending, have bsmith (who has zero usage) and csmith (who has a lot of usage, relatively) each submit several jobs into the pending queue. Alternatively, have bsmith and csmith submit jobs with larger resource requests: jobs that are large enough to automatically go into a pending state due to lack of resources. Those might be jobs that request the whole cluster, even.
bsmith’s jobs should get a higher priority as seen from sprio, and bsmith’s jobs should start earlier than csmith’s. The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline < https://www.massgeneralbrigham.org/complianceline%3E . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
"sshare"...., not share....
And note that the high PriorityWeightAge may be complicating things. We set it to 0. With it set so high, it allows users to gain priority by flooding the queue if you allow high numbers of job submissions and they age up in priority while they're waiting to run.
On Fri, Aug 9, 2024 at 9:15 PM Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
...and what are the top 10-15 lines in your share output?...
On Fri, Aug 9, 2024 at 9:07 PM Drucker, Daniel < DDRUCKER@mclean.harvard.edu> wrote:
PriorityType=priority/multifactor PriorityFavorSmall=YES PriorityWeightAge=50000 PriorityWeightFairshare=100000 PriorityWeightJobSize=0 PriorityWeightQOS=0
In 21.08.8.
On Aug 9, 2024, at 8:36 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
External Email - Use Caution
Yes, well, in that case, it should work as you desire, modulo your slurm.conf settings. What are the relevant lines in yours?
On Fri, Aug 9, 2024 at 6:09 PM Drucker, Daniel < DDRUCKER@mclean.harvard.edu> wrote:
Er, user B has never.
On Aug 9, 2024, at 6:08 PM, Daniel M. Drucker < ddrucker@mclean.harvard.edu> wrote:
Well, let's say user A has completed a million jobs in the last few days as well, and user A has never submitted any before.
On Aug 9, 2024, at 6:03 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
External Email - Use Caution
I don't think fairshare use is updated until jobs finish...
On Fri, Aug 9, 2024 at 5:59 PM Drucker, Daniel via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi Paul from over at mclean.harvard.edu!
I have never added *any* users using sacctmgr - I've always just had everyone I guess automatically join the default account, *mic*. Are you saying that is what is causing my problem?
I'm confused I guess because I would have expected that *within* an account - even if there is only one - users would get their 'fair share' of resources, rather than just defaulting to FIFO or something. But that doesn't seem to be the case.
I do not want any particular user to start out with more priority than any other particular user - I just want to make sure that if user A submits a million jobs at noon, and user B submits one job at 12:01, user B doesn't have to wait until those million jobs finish.
Daniel
On Aug 9, 2024, at 5:47 PM, Paul Raines raines@nmr.mgh.harvard.edu wrote:
This depends on how you have assigned fairshare in sacctmgr when creating the accounts and users. At our site we want fairshare only on accounts and not users, just like you are seeing, so we create accounts with
sacctmgr -i add account $acct Description="$descr" \ fairshare=200 GrpJobsAccrue=8
and users with
sacctmgr -i add user "$u" account=$acct fairshare=parent
If you want users to have their own independent fairshare, you do not use fairshare=parent but assign a real number.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Fri, 9 Aug 2024 5:20pm, Drucker, Daniel via slurm-users wrote:
External Email - Use Caution
I got the opposite result. When I submitted a job as bsmith, they got a lower priority (the number was smaller) than the job submitted as csmith.
bsmith (who has never submitted a job before) got a priority of 98387 (which is 10000 times the 0.983871 FairShare), whereas csmith (who is already running a huge number of jobs and has been for days now) got a priority of 103749.
On Aug 9, 2024, at 5:11 PM, Renfro, Michael Renfro@tntech.edu wrote:
External Email - Use Caution
The format has changed a bit, since none of our RawShares column is ‘parent’.
But you can test this to be certain.
If your cluster already has jobs pending, have bsmith (who has zero usage) and csmith (who has a lot of usage, relatively) each submit several jobs into the pending queue. Alternatively, have bsmith and csmith submit jobs with larger resource requests: jobs that are large enough to automatically go into a pending state due to lack of resources. Those might be jobs that request the whole cluster, even.
bsmith’s jobs should get a higher priority as seen from sprio, and bsmith’s jobs should start earlier than csmith’s. The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline < https://www.massgeneralbrigham.org/complianceline%3E . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
On Aug 9, 2024, at 9:21 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
And note that the high PriorityWeightAge may be complicating things. We set it to 0. With it set so high, it allows users to gain priority by flooding the queue if you allow high numbers of job submissions and they age up in priority while they're waiting to run.
That's a great point. Changed to 0.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
For users with a parent account of "mic", I'd expect the RawShares to be listed as "1", not "parent".
What's the "sprio" output for two jobs of users A and B, and which of them hasn't run any jobs?
Also, the first 15 lines of output for "sshare" (no arguments) would be useful for me.
On Fri, Aug 9, 2024 at 9:52 PM Drucker, Daniel DDRUCKER@mclean.harvard.edu wrote:
On Aug 9, 2024, at 9:21 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
And note that the high PriorityWeightAge may be complicating things. We
set it to 0. With it set so high, it allows users to gain priority by flooding the queue if you allow high numbers of job submissions and they age up in priority while they're waiting to run.
That's a great point. Changed to 0.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline < https://www.massgeneralbrigham.org/complianceline%3E . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
So I'm still getting identical priorities for every job. For example in:
squeue --format="%.18i %.9P %.50j %.8u %.8T %.10M %.9l %.6D %R %.10Q"
the PRIORITY field is 98387 (which is 10000* the fairshare value shown in "sshare -a -A mic") for every single job, even though some of the jobs in the queue were submitted by users who have NEVER submitted a job before, and some of the jobs are users who have been submitting thousands of jobs a day every day for weeks.
This seems ... unfair?
On Aug 9, 2024, at 9:52 PM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU wrote:
On Aug 9, 2024, at 9:21 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
And note that the high PriorityWeightAge may be complicating things. We set it to 0. With it set so high, it allows users to gain priority by flooding the queue if you allow high numbers of job submissions and they age up in priority while they're waiting to run.
That's a great point. Changed to 0.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
Here is what is confusing me I guess. Look at the below. You can see that some people have no usage and some people have a lot of usage. But their FairShare value is all identical.
https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd... seems to say that fairshare=parent should work just fine, but what I am seeing is that it is NOT altering people's FairShare?
$ sshare -l -a Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------ root 0.000000 62159156 0.000000 cpu=170602,mem=1397577591,ene+ root root 1 0.008264 0 0.000000 0.000000 1.000000 inf cpu=0,mem=0,energy=0,node=0,b+ mic 120 0.991736 62159156 1.000000 1.000000 0.991736 cpu=170602,mem=1397577591,ene+ mic aamedina parent 0.991736 2376285 0.038230 0.038230 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic aaruldass parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic acataldo parent 0.991736 13066208 0.210193 0.210193 0.983871 cpu=169648,mem=1389757781,ene+ mic achowdhury parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic ajajoo parent 0.991736 2074727 0.033378 0.033378 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic ajanes parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic amandacao parent 0.991736 202 0.000003 0.000003 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic aromer parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic aweerasek+ parent 0.991736 1059 0.000017 0.000017 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic batwood parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic bleng parent 0.991736 3 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic cdemirlek parent 0.991736 6174 0.000099 0.000099 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic chun parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic ckorponay parent 0.991736 1 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic ddickstein parent 0.991736 116395 0.001873 0.001873 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic ddillon parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic ddrucker parent 0.991736 2033 0.000033 0.000033 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic dlombardo+ parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic ebelleau parent 0.991736 1287758 0.020717 0.020717 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic ejoncas parent 0.991736 26064 0.000419 0.000419 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic eozan parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic epalermo parent 0.991736 202905 0.003264 0.003264 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic epayne parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic epcrabtree parent 0.991736 1 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic fdu parent 0.991736 1137902 0.018307 0.018307 0.983871 cpu=954,mem=7819810,energy=0,+ mic frederic parent 0.991736 1750024 0.028154 0.028154 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic hleblanc parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic itreves parent 0.991736 11 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic itsatsani parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic jcohen parent 0.991736 2 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic jpurcell parent 0.991736 2575695 0.041438 0.041438 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic jsneider parent 0.991736 1 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic kclancy parent 0.991736 595813 0.009585 0.009585 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic kjavaras parent 0.991736 17442 0.000281 0.000281 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic kohashi parent 0.991736 14883 0.000239 0.000239 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic kwebb parent 0.991736 2583 0.000042 0.000042 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic lfleming parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic lhutson parent 0.991736 4248 0.000068 0.000068 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic lnickerson parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic mhalko parent 0.991736 22566 0.000363 0.000363 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic mkuhn parent 0.991736 144709 0.002328 0.002328 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic mmaya parent 0.991736 122603 0.001972 0.001972 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic mrohan parent 0.991736 20309 0.000327 0.000327 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic mthai parent 0.991736 116 0.000002 0.000002 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic nharnett parent 0.991736 59 0.000001 0.000001 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic pkumar parent 0.991736 3330070 0.053574 0.053574 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic pzhukovsky parent 0.991736 310446 0.004994 0.004994 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic qdevignes parent 0.991736 1 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic rbgeary parent 0.991736 216108 0.003477 0.003477 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic rdanyogev parent 0.991736 437267 0.007035 0.007035 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic rvangool parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic saslan parent 0.991736 31244545 0.502662 0.502662 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic sassili parent 0.991736 58935 0.000948 0.000948 0.983871 cpu=0,mem=0,energy=0,node=0,b+ mic sgranger parent 0.991736 195080 0.003138 0.003138 0.983871 cpu=0,mem=0,energy=0,node=0,b+
On Aug 10, 2024, at 7:24 AM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU wrote:
So I'm still getting identical priorities for every job. For example in:
squeue --format="%.18i %.9P %.50j %.8u %.8T %.10M %.9l %.6D %R %.10Q"
the PRIORITY field is 98387 (which is 10000* the fairshare value shown in "sshare -a -A mic") for every single job, even though some of the jobs in the queue were submitted by users who have NEVER submitted a job before, and some of the jobs are users who have been submitting thousands of jobs a day every day for weeks.
This seems ... unfair?
On Aug 9, 2024, at 9:52 PM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU wrote:
On Aug 9, 2024, at 9:21 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote: And note that the high PriorityWeightAge may be complicating things. We set it to 0. With it set so high, it allows users to gain priority by flooding the queue if you allow high numbers of job submissions and they age up in priority while they're waiting to run.
That's a great point. Changed to 0.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
According to https://docs.rc.fas.harvard.edu/kb/fairshare/ and https://slurm.schedmd.com/SUG14/fair_tree.pdf :
"The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares)"
This is clearly not happening on my system:
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------ ... mic acataldo parent 0.991736 13066208 0.210193 0.210193 0.983871 cpu=169648,mem=1389757781,ene+ mic achowdhury parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ ...
Every user has 0.991736 NormShares. Acataldo has EffectvUsage = 0.210193 Achowdhury has EffectvUsage = 0
But both users have the same FairShare. The correct values according to the above formula would be 0.863 and 1.0 respectively.
So what's going on?
On Aug 10, 2024, at 7:36 AM, Daniel M. Drucker ddrucker@mclean.harvard.edu wrote:
Here is what is confusing me I guess. Look at the below. You can see that some people have no usage and some people have a lot of usage. But their FairShare value is all identical.
https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd... seems to say that fairshare=parent should work just fine, but what I am seeing is that it is NOT altering people's FairShare?
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
I just set PriorityFlags=NO_FAIR_TREE and this seems to have solved the problem!
On Aug 10, 2024, at 7:45 AM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU wrote:
According to https://docs.rc.fas.harvard.edu/kb/fairshare/ and https://slurm.schedmd.com/SUG14/fair_tree.pdf :
"The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares)"
This is clearly not happening on my system:
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------ ... mic acataldo parent 0.991736 13066208 0.210193 0.210193 0.983871 cpu=169648,mem=1389757781,ene+ mic achowdhury parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ ...
Every user has 0.991736 NormShares. Acataldo has EffectvUsage = 0.210193 Achowdhury has EffectvUsage = 0
But both users have the same FairShare. The correct values according to the above formula would be 0.863 and 1.0 respectively.
So what's going on?
On Aug 10, 2024, at 7:36 AM, Daniel M. Drucker ddrucker@mclean.harvard.edu wrote:
Here is what is confusing me I guess. Look at the below. You can see that some people have no usage and some people have a lot of usage. But their FairShare value is all identical.
https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd... seems to say that fairshare=parent should work just fine, but what I am seeing is that it is NOT altering people's FairShare?
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
Hmm, no. That solved the problem of everyone having the same FairShare, but even after restarting slurmd and doing reconfigure, if I submit a job as someone with a huge usage and someone with zero usage, they both end up with the same Priority.
On Aug 10, 2024, at 8:05 AM, Daniel M. Drucker ddrucker@mclean.harvard.edu wrote:
I just set PriorityFlags=NO_FAIR_TREE and this seems to have solved the problem!
On Aug 10, 2024, at 7:45 AM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU wrote:
According to https://docs.rc.fas.harvard.edu/kb/fairshare/ and https://slurm.schedmd.com/SUG14/fair_tree.pdf :
"The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares)"
This is clearly not happening on my system:
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------ ... mic acataldo parent 0.991736 13066208 0.210193 0.210193 0.983871 cpu=169648,mem=1389757781,ene+ mic achowdhury parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ ...
Every user has 0.991736 NormShares. Acataldo has EffectvUsage = 0.210193 Achowdhury has EffectvUsage = 0
But both users have the same FairShare. The correct values according to the above formula would be 0.863 and 1.0 respectively.
So what's going on?
On Aug 10, 2024, at 7:36 AM, Daniel M. Drucker ddrucker@mclean.harvard.edu wrote:
Here is what is confusing me I guess. Look at the below. You can see that some people have no usage and some people have a lot of usage. But their FairShare value is all identical.
https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd... seems to say that fairshare=parent should work just fine, but what I am seeing is that it is NOT altering people's FairShare?
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
And now, a few hours later - with no changes made - everyone has the same fairshare?
$ sshare -l -a Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare GrpTRESMins TRESRunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ------------------------------ ------------------------------ root 0.000000 63235972 0.000000 1.000000 cpu=188835,mem=1546941371,ene+ root root 1 0.008264 0 0.000000 0.000000 1.000000 cpu=0,mem=0,energy=0,node=0,b+ mic 120 0.991736 63235972 1.000000 1.000000 0.497120 cpu=188835,mem=1546941371,ene+ mic aamedina parent 0.991736 2351906 0.037193 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aaruldass parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic acataldo parent 0.991736 14637614 0.231476 1.000000 0.497120 cpu=188031,mem=1540350361,ene+ mic achowdhury parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajajoo parent 0.991736 2053441 0.032473 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajanes parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic amandacao parent 0.991736 200 0.000003 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aromer parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aweerasek+ parent 0.991736 1048 0.000017 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic batwood parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic bleng parent 0.991736 3 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic cdemirlek parent 0.991736 6110 0.000097 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic chun parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+
I am so confused.
On Aug 10, 2024, at 8:11 AM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU wrote:
Hmm, no. That solved the problem of everyone having the same FairShare, but even after restarting slurmd and doing reconfigure, if I submit a job as someone with a huge usage and someone with zero usage, they both end up with the same Priority.
On Aug 10, 2024, at 8:05 AM, Daniel M. Drucker ddrucker@mclean.harvard.edu wrote:
I just set PriorityFlags=NO_FAIR_TREE and this seems to have solved the problem!
On Aug 10, 2024, at 7:45 AM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU wrote:
According to https://docs.rc.fas.harvard.edu/kb/fairshare/ and https://slurm.schedmd.com/SUG14/fair_tree.pdf :
"The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares)"
This is clearly not happening on my system:
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------ ... mic acataldo parent 0.991736 13066208 0.210193 0.210193 0.983871 cpu=169648,mem=1389757781,ene+ mic achowdhury parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ ...
Every user has 0.991736 NormShares. Acataldo has EffectvUsage = 0.210193 Achowdhury has EffectvUsage = 0
But both users have the same FairShare. The correct values according to the above formula would be 0.863 and 1.0 respectively.
So what's going on?
On Aug 10, 2024, at 7:36 AM, Daniel M. Drucker ddrucker@mclean.harvard.edu wrote:
Here is what is confusing me I guess. Look at the below. You can see that some people have no usage and some people have a lot of usage. But their FairShare value is all identical.
https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd... seems to say that fairshare=parent should work just fine, but what I am seeing is that it is NOT altering people's FairShare?
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
fairshare=parent sets the user association to effectively compete at the account level, so this is behaving as intended. It's effectively ignoring the users' usage when competing with others inside the same account. That is not want you want. Give them all the same numeric value, not parent.
Fair Tree (the default) handles a single account just fine, but you do not want fairshare=parent there either.
Ryan
On 8/10/24 08:05, Drucker, Daniel via slurm-users wrote:
And now, a few hours later - with no changes made - everyone has the same fairshare?
$ sshare -l -a Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare GrpTRESMins TRESRunMins
root 0.000000 63235972 0.000000 1.000000 cpu=188835,mem=1546941371,ene+ root root 1 0.008264 0 0.000000 0.000000 1.000000 cpu=0,mem=0,energy=0,node=0,b+ mic 120 0.991736 63235972 1.000000 1.000000 0.497120 cpu=188835,mem=1546941371,ene+ mic aamedina parent 0.991736 2351906 0.037193 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aaruldass parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic acataldo parent 0.991736 14637614 0.231476 1.000000 0.497120 cpu=188031,mem=1540350361,ene+ mic achowdhury parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajajoo parent 0.991736 2053441 0.032473 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajanes parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic amandacao parent 0.991736 200 0.000003 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aromer parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aweerasek+ parent 0.991736 1048 0.000017 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic batwood parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic bleng parent 0.991736 3 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic cdemirlek parent 0.991736 6110 0.000097 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic chun parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+
I am so confused.
On Aug 10, 2024, at 8:11 AM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU wrote:
Hmm, no. That solved the problem of everyone having the same FairShare, but even after restarting slurmd and doing reconfigure, if I submit a job as someone with a huge usage and someone with zero usage, they both end up with the same Priority.
On Aug 10, 2024, at 8:05 AM, Daniel M. Drucker ddrucker@mclean.harvard.edu wrote:
I just set PriorityFlags=NO_FAIR_TREE and this seems to have solved the problem!
On Aug 10, 2024, at 7:45 AM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU wrote:
According to https://docs.rc.fas.harvard.edu/kb/fairshare/ and https://slurm.schedmd.com/SUG14/fair_tree.pdf%C2%A0:
"The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares)"
This is clearly not happening on my system:
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins
... mic acataldo parent 0.991736 13066208 0.210193 0.210193 0.983871 cpu=169648,mem=1389757781,ene+ mic achowdhury parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ ...
Every user has 0.991736 NormShares. Acataldo has EffectvUsage = 0.210193 Achowdhury has EffectvUsage = 0
But both users have the same FairShare. The correct values according to the above formula would be 0.863 and 1.0 respectively.
So what's going on?
On Aug 10, 2024, at 7:36 AM, Daniel M. Drucker ddrucker@mclean.harvard.edu wrote:
Here is what is confusing me I guess. Look at the below. You can see that some people have no usage and some people have a lot of usage. But their FairShare value is all identical.
https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd... seems to say that fairshare=parent should work just fine, but what I am seeing is that it is NOT altering people's FairShare?
>
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
...ok... sure.... I had no idea where the "parent" label came from. This makes perfect sense. It will default to "1", I think.
On Sat, Aug 10, 2024 at 12:24 PM Ryan Cox ryan_cox@byu.edu wrote:
fairshare=parent sets the user association to effectively compete at the account level, so this is behaving as intended. It's effectively ignoring the users' usage when competing with others inside the same account. That is not want you want. Give them all the same numeric value, not parent.
Fair Tree (the default) handles a single account just fine, but you do not want fairshare=parent there either.
Ryan
On 8/10/24 08:05, Drucker, Daniel via slurm-users wrote:
And now, a few hours later - with no changes made - everyone has the same fairshare?
$ sshare -l -a Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare GrpTRESMins TRESRunMins
root 0.000000 63235972 0.000000 1.000000 cpu=188835,mem=1546941371,ene+ root root 1 0.008264 0 0.000000 0.000000 1.000000 cpu=0,mem=0,energy=0,node=0,b+ mic 120 0.991736 63235972 1.000000 1.000000 0.497120 cpu=188835,mem=1546941371,ene+ mic aamedina parent 0.991736 2351906 0.037193 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aaruldass parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic acataldo parent 0.991736 14637614 0.231476 1.000000 0.497120 cpu=188031,mem=1540350361,ene+ mic achowdhury parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajajoo parent 0.991736 2053441 0.032473 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajanes parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic amandacao parent 0.991736 200 0.000003 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aromer parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aweerasek+ parent 0.991736 1048 0.000017 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic batwood parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic bleng parent 0.991736 3 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic cdemirlek parent 0.991736 6110 0.000097 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic chun parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+
I am so confused.
On Aug 10, 2024, at 8:11 AM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU DDRUCKER@MCLEAN.HARVARD.EDU wrote:
Hmm, no. That solved the problem of everyone having the same FairShare, but even after restarting slurmd and doing reconfigure, if I submit a job as someone with a huge usage and someone with zero usage, they both end up with the same Priority.
On Aug 10, 2024, at 8:05 AM, Daniel M. Drucker ddrucker@mclean.harvard.edu ddrucker@mclean.harvard.edu wrote:
I just set PriorityFlags=NO_FAIR_TREE and this seems to have solved the problem!
On Aug 10, 2024, at 7:45 AM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU DDRUCKER@MCLEAN.HARVARD.EDU wrote:
According to https://docs.rc.fas.harvard.edu/kb/fairshare/ and https://slurm.schedmd.com/SUG14/fair_tree.pdf :
"The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares)"
This is clearly not happening on my system:
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins
... mic acataldo parent 0.991736 13066208 0.210193 0.210193 0.983871 cpu=169648,mem=1389757781,ene+ mic achowdhury parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ ...
Every user has 0.991736 NormShares. Acataldo has EffectvUsage = 0.210193 Achowdhury has EffectvUsage = 0
But both users have the same FairShare. The correct values according to the above formula would be 0.863 and 1.0 respectively.
So what's going on?
On Aug 10, 2024, at 7:36 AM, Daniel M. Drucker ddrucker@mclean.harvard.edu ddrucker@mclean.harvard.edu wrote:
Here is what is confusing me I guess. Look at the below. You can see that some people have no usage and some people have a lot of usage. But their FairShare value is all identical.
https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd... seems to say that fairshare=parent should work just fine, but what I am seeing is that it is NOT altering people's FairShare?
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
We use the following relevant settings...
PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityCalcPeriod=00:02:00 PriorityMaxAge=3-0 PriorityWeightAge=0 PriorityWeightFairshare=2000000 PriorityWeightJobSize=1 PriorityWeightPartition=200 PriorityWeightQOS=1000000 PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
...however, that doesn't provide any information about the account organization. and RawShares assignments to the accounts (and fairshare in the gpu partitions is yet another rathole....).
We do use the Tree feature, which is required in our environment. It's what enables (I think) the proper division of share among accounts in subaccounts. It's been years since I've looked at this, so YMMV...
We have "condo" accounts, and a non-condo account called "default". When investigator or group buys equipment, we create a SLURM account and QoS for it. We actually set the Tres limits on the QoS to 1.25X the number of cores and GB memory of the purchase, but assign a RawShares value based on the actually number of cores purchase divided by the total cores in the cluster, ,multiplied by 1000 (to give a meaningful integer for RawShares - maybe we should bump that to 10000). The condo QoS Priorities are set to 10000.
The "default" account is assigned a RawShares value base on the number of cores purchased by the university, and provides access to exploratory (no charge) and premium (more cores, higher Priority, but still a lot less than 10000). The default account is oversubscribed when comparing QoS Tres limits to RawShares, but that's OK, or... that's just the way it is. We want the condo accounts to have the most benefit from the FairShare mechanism.
So....
The root account has children. The root account does not have a RawShares assignment.
The default account is one child with the root account as parent.
The primary condo accounts are children of the root account. They have the RawShares set based on purchased cores.
Some of the primary condo accounts where the equipment was purchased by multiple investigator groups have child condo accounts and QoS', but without their own RawShares assignments.
With the FairTree mechanism, this gives us...
FairShare between condos (and the default account)...
FairShare within sub-account condos, as part of the parent condo...
FairShare within the leaf condo among users.
One of us obviously needs to diagram this...
regards, s
On Sat, Aug 10, 2024 at 10:05 AM Drucker, Daniel < DDRUCKER@mclean.harvard.edu> wrote:
And now, a few hours later - with no changes made - everyone has the same fairshare?
$ sshare -l -a Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare GrpTRESMins TRESRunMins
root 0.000000 63235972 0.000000 1.000000 cpu=188835,mem=1546941371,ene+ root root 1 0.008264 0 0.000000 0.000000 1.000000 cpu=0,mem=0,energy=0,node=0,b+ mic 120 0.991736 63235972 1.000000 1.000000 0.497120 cpu=188835,mem=1546941371,ene+ mic aamedina parent 0.991736 2351906 0.037193 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aaruldass parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic acataldo parent 0.991736 14637614 0.231476 1.000000 0.497120 cpu=188031,mem=1540350361,ene+ mic achowdhury parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajajoo parent 0.991736 2053441 0.032473 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajanes parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic amandacao parent 0.991736 200 0.000003 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aromer parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aweerasek+ parent 0.991736 1048 0.000017 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic batwood parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic bleng parent 0.991736 3 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic cdemirlek parent 0.991736 6110 0.000097 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic chun parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+
I am so confused.
On Aug 10, 2024, at 8:11 AM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU wrote:
Hmm, no. That solved the problem of everyone having the same FairShare, but even after restarting slurmd and doing reconfigure, if I submit a job as someone with a huge usage and someone with zero usage, they both end up with the same Priority.
On Aug 10, 2024, at 8:05 AM, Daniel M. Drucker < ddrucker@mclean.harvard.edu> wrote:
I just set PriorityFlags=NO_FAIR_TREE and this seems to have solved the problem!
On Aug 10, 2024, at 7:45 AM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU wrote:
According to https://docs.rc.fas.harvard.edu/kb/fairshare/ and https://slurm.schedmd.com/SUG14/fair_tree.pdf :
"The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares)"
This is clearly not happening on my system:
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins
... mic acataldo parent 0.991736 13066208 0.210193 0.210193 0.983871 cpu=169648,mem=1389757781,ene+ mic achowdhury parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ ...
Every user has 0.991736 NormShares. Acataldo has EffectvUsage = 0.210193 Achowdhury has EffectvUsage = 0
But both users have the same FairShare. The correct values according to the above formula would be 0.863 and 1.0 respectively.
So what's going on?
On Aug 10, 2024, at 7:36 AM, Daniel M. Drucker < ddrucker@mclean.harvard.edu> wrote:
Here is what is confusing me I guess. Look at the below. You can see that some people have no usage and some people have a lot of usage. But their FairShare value is all identical.
https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd... seems to say that fairshare=parent should work just fine, but what I am seeing is that it is NOT altering people's FairShare?
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
...and there's not actually one account in your setup, is there? There should at least be a "root" and a "mic" account, I think.
I don't recall whether you'd sent the output of "sshare | head -15"...
On Sat, Aug 10, 2024 at 2:30 PM Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
We use the following relevant settings...
PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityCalcPeriod=00:02:00 PriorityMaxAge=3-0 PriorityWeightAge=0 PriorityWeightFairshare=2000000 PriorityWeightJobSize=1 PriorityWeightPartition=200 PriorityWeightQOS=1000000 PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
...however, that doesn't provide any information about the account organization. and RawShares assignments to the accounts (and fairshare in the gpu partitions is yet another rathole....).
We do use the Tree feature, which is required in our environment. It's what enables (I think) the proper division of share among accounts in subaccounts. It's been years since I've looked at this, so YMMV...
We have "condo" accounts, and a non-condo account called "default". When investigator or group buys equipment, we create a SLURM account and QoS for it. We actually set the Tres limits on the QoS to 1.25X the number of cores and GB memory of the purchase, but assign a RawShares value based on the actually number of cores purchase divided by the total cores in the cluster, ,multiplied by 1000 (to give a meaningful integer for RawShares - maybe we should bump that to 10000). The condo QoS Priorities are set to 10000.
The "default" account is assigned a RawShares value base on the number of cores purchased by the university, and provides access to exploratory (no charge) and premium (more cores, higher Priority, but still a lot less than 10000). The default account is oversubscribed when comparing QoS Tres limits to RawShares, but that's OK, or... that's just the way it is. We want the condo accounts to have the most benefit from the FairShare mechanism.
So....
The root account has children. The root account does not have a RawShares assignment.
The default account is one child with the root account as parent.
The primary condo accounts are children of the root account. They have the RawShares set based on purchased cores.
Some of the primary condo accounts where the equipment was purchased by multiple investigator groups have child condo accounts and QoS', but without their own RawShares assignments.
With the FairTree mechanism, this gives us...
FairShare between condos (and the default account)...
FairShare within sub-account condos, as part of the parent condo...
FairShare within the leaf condo among users.
One of us obviously needs to diagram this...
regards, s
On Sat, Aug 10, 2024 at 10:05 AM Drucker, Daniel < DDRUCKER@mclean.harvard.edu> wrote:
And now, a few hours later - with no changes made - everyone has the same fairshare?
$ sshare -l -a Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare GrpTRESMins TRESRunMins
root 0.000000 63235972 0.000000 1.000000 cpu=188835,mem=1546941371,ene+ root root 1 0.008264 0 0.000000 0.000000 1.000000 cpu=0,mem=0,energy=0,node=0,b+ mic 120 0.991736 63235972 1.000000 1.000000 0.497120 cpu=188835,mem=1546941371,ene+ mic aamedina parent 0.991736 2351906 0.037193 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aaruldass parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic acataldo parent 0.991736 14637614 0.231476 1.000000 0.497120 cpu=188031,mem=1540350361,ene+ mic achowdhury parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajajoo parent 0.991736 2053441 0.032473 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajanes parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic amandacao parent 0.991736 200 0.000003 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aromer parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aweerasek+ parent 0.991736 1048 0.000017 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic batwood parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic bleng parent 0.991736 3 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic cdemirlek parent 0.991736 6110 0.000097 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic chun parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+
I am so confused.
On Aug 10, 2024, at 8:11 AM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU wrote:
Hmm, no. That solved the problem of everyone having the same FairShare, but even after restarting slurmd and doing reconfigure, if I submit a job as someone with a huge usage and someone with zero usage, they both end up with the same Priority.
On Aug 10, 2024, at 8:05 AM, Daniel M. Drucker < ddrucker@mclean.harvard.edu> wrote:
I just set PriorityFlags=NO_FAIR_TREE and this seems to have solved the problem!
On Aug 10, 2024, at 7:45 AM, Drucker, Daniel DDRUCKER@MCLEAN.HARVARD.EDU wrote:
According to https://docs.rc.fas.harvard.edu/kb/fairshare/ and https://slurm.schedmd.com/SUG14/fair_tree.pdf :
"The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares)"
This is clearly not happening on my system:
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins
... mic acataldo parent 0.991736 13066208 0.210193 0.210193 0.983871 cpu=169648,mem=1389757781,ene+ mic achowdhury parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ ...
Every user has 0.991736 NormShares. Acataldo has EffectvUsage = 0.210193 Achowdhury has EffectvUsage = 0
But both users have the same FairShare. The correct values according to the above formula would be 0.863 and 1.0 respectively.
So what's going on?
On Aug 10, 2024, at 7:36 AM, Daniel M. Drucker < ddrucker@mclean.harvard.edu> wrote:
Here is what is confusing me I guess. Look at the below. You can see that some people have no usage and some people have a lot of usage. But their FairShare value is all identical.
https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd... seems to say that fairshare=parent should work just fine, but what I am seeing is that it is NOT altering people's FairShare?
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
Yes, there is 'root' and 'mic', and everyone is under 'mic.
No, I don't know any Steve.
So what you're saying is I *must* at account-creation time explicitly assign a fairshare value? Would it be sufficient to just say, in my account creation script,
sacctmgr modify user $NEWUSERNAME set fairshare=1
?
I'm still struggling to understand why that is different from fairshare=parent, if everyone has the same value.
Daniel
On Aug 10, 2024, at 2:34 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
External Email - Use Caution
...and there's not actually one account in your setup, is there? There should at least be a "root" and a "mic" account, I think.
I don't recall whether you'd sent the output of "sshare | head -15"...
On Sat, Aug 10, 2024 at 2:30 PM Fulcomer, Samuel <samuel_fulcomer@brown.edumailto:samuel_fulcomer@brown.edu> wrote: We use the following relevant settings...
PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityCalcPeriod=00:02:00 PriorityMaxAge=3-0 PriorityWeightAge=0 PriorityWeightFairshare=2000000 PriorityWeightJobSize=1 PriorityWeightPartition=200 PriorityWeightQOS=1000000 PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
...however, that doesn't provide any information about the account organization. and RawShares assignments to the accounts (and fairshare in the gpu partitions is yet another rathole....).
We do use the Tree feature, which is required in our environment. It's what enables (I think) the proper division of share among accounts in subaccounts. It's been years since I've looked at this, so YMMV...
We have "condo" accounts, and a non-condo account called "default". When investigator or group buys equipment, we create a SLURM account and QoS for it. We actually set the Tres limits on the QoS to 1.25X the number of cores and GB memory of the purchase, but assign a RawShares value based on the actually number of cores purchase divided by the total cores in the cluster, ,multiplied by 1000 (to give a meaningful integer for RawShares - maybe we should bump that to 10000). The condo QoS Priorities are set to 10000.
The "default" account is assigned a RawShares value base on the number of cores purchased by the university, and provides access to exploratory (no charge) and premium (more cores, higher Priority, but still a lot less than 10000). The default account is oversubscribed when comparing QoS Tres limits to RawShares, but that's OK, or... that's just the way it is. We want the condo accounts to have the most benefit from the FairShare mechanism.
So....
The root account has children. The root account does not have a RawShares assignment.
The default account is one child with the root account as parent.
The primary condo accounts are children of the root account. They have the RawShares set based on purchased cores.
Some of the primary condo accounts where the equipment was purchased by multiple investigator groups have child condo accounts and QoS', but without their own RawShares assignments.
With the FairTree mechanism, this gives us...
FairShare between condos (and the default account)...
FairShare within sub-account condos, as part of the parent condo...
FairShare within the leaf condo among users.
One of us obviously needs to diagram this...
regards, s
On Sat, Aug 10, 2024 at 10:05 AM Drucker, Daniel <DDRUCKER@mclean.harvard.edumailto:DDRUCKER@mclean.harvard.edu> wrote: And now, a few hours later - with no changes made - everyone has the same fairshare?
$ sshare -l -a Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare GrpTRESMins TRESRunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ------------------------------ ------------------------------ root 0.000000 63235972 0.000000 1.000000 cpu=188835,mem=1546941371,ene+ root root 1 0.008264 0 0.000000 0.000000 1.000000 cpu=0,mem=0,energy=0,node=0,b+ mic 120 0.991736 63235972 1.000000 1.000000 0.497120 cpu=188835,mem=1546941371,ene+ mic aamedina parent 0.991736 2351906 0.037193 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aaruldass parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic acataldo parent 0.991736 14637614 0.231476 1.000000 0.497120 cpu=188031,mem=1540350361,ene+ mic achowdhury parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajajoo parent 0.991736 2053441 0.032473 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajanes parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic amandacao parent 0.991736 200 0.000003 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aromer parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aweerasek+ parent 0.991736 1048 0.000017 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic batwood parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic bleng parent 0.991736 3 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic cdemirlek parent 0.991736 6110 0.000097 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic chun parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+
I am so confused.
On Aug 10, 2024, at 8:11 AM, Drucker, Daniel <DDRUCKER@MCLEAN.HARVARD.EDUmailto:DDRUCKER@MCLEAN.HARVARD.EDU> wrote:
Hmm, no. That solved the problem of everyone having the same FairShare, but even after restarting slurmd and doing reconfigure, if I submit a job as someone with a huge usage and someone with zero usage, they both end up with the same Priority.
On Aug 10, 2024, at 8:05 AM, Daniel M. Drucker <ddrucker@mclean.harvard.edumailto:ddrucker@mclean.harvard.edu> wrote:
I just set PriorityFlags=NO_FAIR_TREE and this seems to have solved the problem!
On Aug 10, 2024, at 7:45 AM, Drucker, Daniel <DDRUCKER@MCLEAN.HARVARD.EDUmailto:DDRUCKER@MCLEAN.HARVARD.EDU> wrote:
According to https://docs.rc.fas.harvard.edu/kb/fairshare/ and https://slurm.schedmd.com/SUG14/fair_tree.pdf :
"The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares)"
This is clearly not happening on my system:
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------ ... mic acataldo parent 0.991736 13066208 0.210193 0.210193 0.983871 cpu=169648,mem=1389757781,ene+ mic achowdhury parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ ...
Every user has 0.991736 NormShares. Acataldo has EffectvUsage = 0.210193 Achowdhury has EffectvUsage = 0
But both users have the same FairShare. The correct values according to the above formula would be 0.863 and 1.0 respectively.
So what's going on?
On Aug 10, 2024, at 7:36 AM, Daniel M. Drucker <ddrucker@mclean.harvard.edumailto:ddrucker@mclean.harvard.edu> wrote:
Here is what is confusing me I guess. Look at the below. You can see that some people have no usage and some people have a lot of usage. But their FairShare value is all identical.
https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd... seems to say that fairshare=parent should work just fine, but what I am seeing is that it is NOT altering people's FairShare?
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
***AHA**888 I FOUND IT!
FairShare=parenthttps://slurm.schedmd.com/classic_fair_share.html#parent
It is possible to disable the fairshare at certain levels of the fair share hierarchy by using the FairShare=parent option of sacctmgr. For users and accounts with FairShare=parent the normalized shares and effective usage values from the parent in the hierarchy will be used when calculating fairshare priories.
If all users in an account are configured with FairShare=parent the result is that all the jobs drawing from that account will get the same fairshare priority, based on the accounts total usage. No additional fairness is added based on a user's individual usage.
On Aug 10, 2024, at 6:21 PM, Daniel M. Drucker ddrucker@mclean.harvard.edu wrote:
Yes, there is 'root' and 'mic', and everyone is under 'mic.
No, I don't know any Steve.
So what you're saying is I *must* at account-creation time explicitly assign a fairshare value? Would it be sufficient to just say, in my account creation script,
sacctmgr modify user $NEWUSERNAME set fairshare=1
?
I'm still struggling to understand why that is different from fairshare=parent, if everyone has the same value.
Daniel
On Aug 10, 2024, at 2:34 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
External Email - Use Caution
...and there's not actually one account in your setup, is there? There should at least be a "root" and a "mic" account, I think.
I don't recall whether you'd sent the output of "sshare | head -15"...
On Sat, Aug 10, 2024 at 2:30 PM Fulcomer, Samuel <samuel_fulcomer@brown.edumailto:samuel_fulcomer@brown.edu> wrote: We use the following relevant settings...
PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityCalcPeriod=00:02:00 PriorityMaxAge=3-0 PriorityWeightAge=0 PriorityWeightFairshare=2000000 PriorityWeightJobSize=1 PriorityWeightPartition=200 PriorityWeightQOS=1000000 PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
...however, that doesn't provide any information about the account organization. and RawShares assignments to the accounts (and fairshare in the gpu partitions is yet another rathole....).
We do use the Tree feature, which is required in our environment. It's what enables (I think) the proper division of share among accounts in subaccounts. It's been years since I've looked at this, so YMMV...
We have "condo" accounts, and a non-condo account called "default". When investigator or group buys equipment, we create a SLURM account and QoS for it. We actually set the Tres limits on the QoS to 1.25X the number of cores and GB memory of the purchase, but assign a RawShares value based on the actually number of cores purchase divided by the total cores in the cluster, ,multiplied by 1000 (to give a meaningful integer for RawShares - maybe we should bump that to 10000). The condo QoS Priorities are set to 10000.
The "default" account is assigned a RawShares value base on the number of cores purchased by the university, and provides access to exploratory (no charge) and premium (more cores, higher Priority, but still a lot less than 10000). The default account is oversubscribed when comparing QoS Tres limits to RawShares, but that's OK, or... that's just the way it is. We want the condo accounts to have the most benefit from the FairShare mechanism.
So....
The root account has children. The root account does not have a RawShares assignment.
The default account is one child with the root account as parent.
The primary condo accounts are children of the root account. They have the RawShares set based on purchased cores.
Some of the primary condo accounts where the equipment was purchased by multiple investigator groups have child condo accounts and QoS', but without their own RawShares assignments.
With the FairTree mechanism, this gives us...
FairShare between condos (and the default account)...
FairShare within sub-account condos, as part of the parent condo...
FairShare within the leaf condo among users.
One of us obviously needs to diagram this...
regards, s
On Sat, Aug 10, 2024 at 10:05 AM Drucker, Daniel <DDRUCKER@mclean.harvard.edumailto:DDRUCKER@mclean.harvard.edu> wrote: And now, a few hours later - with no changes made - everyone has the same fairshare?
$ sshare -l -a Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare GrpTRESMins TRESRunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ------------------------------ ------------------------------ root 0.000000 63235972 0.000000 1.000000 cpu=188835,mem=1546941371,ene+ root root 1 0.008264 0 0.000000 0.000000 1.000000 cpu=0,mem=0,energy=0,node=0,b+ mic 120 0.991736 63235972 1.000000 1.000000 0.497120 cpu=188835,mem=1546941371,ene+ mic aamedina parent 0.991736 2351906 0.037193 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aaruldass parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic acataldo parent 0.991736 14637614 0.231476 1.000000 0.497120 cpu=188031,mem=1540350361,ene+ mic achowdhury parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajajoo parent 0.991736 2053441 0.032473 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajanes parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic amandacao parent 0.991736 200 0.000003 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aromer parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aweerasek+ parent 0.991736 1048 0.000017 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic batwood parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic bleng parent 0.991736 3 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic cdemirlek parent 0.991736 6110 0.000097 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic chun parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+
I am so confused.
On Aug 10, 2024, at 8:11 AM, Drucker, Daniel <DDRUCKER@MCLEAN.HARVARD.EDUmailto:DDRUCKER@MCLEAN.HARVARD.EDU> wrote:
Hmm, no. That solved the problem of everyone having the same FairShare, but even after restarting slurmd and doing reconfigure, if I submit a job as someone with a huge usage and someone with zero usage, they both end up with the same Priority.
On Aug 10, 2024, at 8:05 AM, Daniel M. Drucker <ddrucker@mclean.harvard.edumailto:ddrucker@mclean.harvard.edu> wrote:
I just set PriorityFlags=NO_FAIR_TREE and this seems to have solved the problem!
On Aug 10, 2024, at 7:45 AM, Drucker, Daniel <DDRUCKER@MCLEAN.HARVARD.EDUmailto:DDRUCKER@MCLEAN.HARVARD.EDU> wrote:
According to https://docs.rc.fas.harvard.edu/kb/fairshare/ and https://slurm.schedmd.com/SUG14/fair_tree.pdf :
"The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares)"
This is clearly not happening on my system:
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------ ... mic acataldo parent 0.991736 13066208 0.210193 0.210193 0.983871 cpu=169648,mem=1389757781,ene+ mic achowdhury parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ ...
Every user has 0.991736 NormShares. Acataldo has EffectvUsage = 0.210193 Achowdhury has EffectvUsage = 0
But both users have the same FairShare. The correct values according to the above formula would be 0.863 and 1.0 respectively.
So what's going on?
On Aug 10, 2024, at 7:36 AM, Daniel M. Drucker <ddrucker@mclean.harvard.edumailto:ddrucker@mclean.harvard.edu> wrote:
Here is what is confusing me I guess. Look at the below. You can see that some people have no usage and some people have a lot of usage. But their FairShare value is all identical.
https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd... seems to say that fairshare=parent should work just fine, but what I am seeing is that it is NOT altering people's FairShare?
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
Doing this for all my slurm users appears to have, finally, fixed the problem!!
Is there any way to make everyone get a default numeric (say, 100) fairshare value instead of "parent", so I wouldn't have to explicitly add slurm users? I've always just let slurm automatically add users.
On Aug 10, 2024, at 6:21 PM, Drucker, Daniel via slurm-users slurm-users@lists.schedmd.com wrote:
External Email - Use Caution
Yes, there is 'root' and 'mic', and everyone is under 'mic.
No, I don't know any Steve.
So what you're saying is I *must* at account-creation time explicitly assign a fairshare value? Would it be sufficient to just say, in my account creation script,
sacctmgr modify user $NEWUSERNAME set fairshare=1
?
I'm still struggling to understand why that is different from fairshare=parent, if everyone has the same value.
Daniel
On Aug 10, 2024, at 2:34 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote:
External Email - Use Caution
...and there's not actually one account in your setup, is there? There should at least be a "root" and a "mic" account, I think.
I don't recall whether you'd sent the output of "sshare | head -15"...
On Sat, Aug 10, 2024 at 2:30 PM Fulcomer, Samuel <samuel_fulcomer@brown.edumailto:samuel_fulcomer@brown.edu> wrote: We use the following relevant settings...
PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityCalcPeriod=00:02:00 PriorityMaxAge=3-0 PriorityWeightAge=0 PriorityWeightFairshare=2000000 PriorityWeightJobSize=1 PriorityWeightPartition=200 PriorityWeightQOS=1000000 PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
...however, that doesn't provide any information about the account organization. and RawShares assignments to the accounts (and fairshare in the gpu partitions is yet another rathole....).
We do use the Tree feature, which is required in our environment. It's what enables (I think) the proper division of share among accounts in subaccounts. It's been years since I've looked at this, so YMMV...
We have "condo" accounts, and a non-condo account called "default". When investigator or group buys equipment, we create a SLURM account and QoS for it. We actually set the Tres limits on the QoS to 1.25X the number of cores and GB memory of the purchase, but assign a RawShares value based on the actually number of cores purchase divided by the total cores in the cluster, ,multiplied by 1000 (to give a meaningful integer for RawShares - maybe we should bump that to 10000). The condo QoS Priorities are set to 10000.
The "default" account is assigned a RawShares value base on the number of cores purchased by the university, and provides access to exploratory (no charge) and premium (more cores, higher Priority, but still a lot less than 10000). The default account is oversubscribed when comparing QoS Tres limits to RawShares, but that's OK, or... that's just the way it is. We want the condo accounts to have the most benefit from the FairShare mechanism.
So....
The root account has children. The root account does not have a RawShares assignment.
The default account is one child with the root account as parent.
The primary condo accounts are children of the root account. They have the RawShares set based on purchased cores.
Some of the primary condo accounts where the equipment was purchased by multiple investigator groups have child condo accounts and QoS', but without their own RawShares assignments.
With the FairTree mechanism, this gives us...
FairShare between condos (and the default account)...
FairShare within sub-account condos, as part of the parent condo...
FairShare within the leaf condo among users.
One of us obviously needs to diagram this...
regards, s
On Sat, Aug 10, 2024 at 10:05 AM Drucker, Daniel <DDRUCKER@mclean.harvard.edumailto:DDRUCKER@mclean.harvard.edu> wrote: And now, a few hours later - with no changes made - everyone has the same fairshare?
$ sshare -l -a Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare GrpTRESMins TRESRunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ------------------------------ ------------------------------ root 0.000000 63235972 0.000000 1.000000 cpu=188835,mem=1546941371,ene+ root root 1 0.008264 0 0.000000 0.000000 1.000000 cpu=0,mem=0,energy=0,node=0,b+ mic 120 0.991736 63235972 1.000000 1.000000 0.497120 cpu=188835,mem=1546941371,ene+ mic aamedina parent 0.991736 2351906 0.037193 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aaruldass parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic acataldo parent 0.991736 14637614 0.231476 1.000000 0.497120 cpu=188031,mem=1540350361,ene+ mic achowdhury parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajajoo parent 0.991736 2053441 0.032473 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic ajanes parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic amandacao parent 0.991736 200 0.000003 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aromer parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic aweerasek+ parent 0.991736 1048 0.000017 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic batwood parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic bleng parent 0.991736 3 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic cdemirlek parent 0.991736 6110 0.000097 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+ mic chun parent 0.991736 0 0.000000 1.000000 0.497120 cpu=0,mem=0,energy=0,node=0,b+
I am so confused.
On Aug 10, 2024, at 8:11 AM, Drucker, Daniel <DDRUCKER@MCLEAN.HARVARD.EDUmailto:DDRUCKER@MCLEAN.HARVARD.EDU> wrote:
Hmm, no. That solved the problem of everyone having the same FairShare, but even after restarting slurmd and doing reconfigure, if I submit a job as someone with a huge usage and someone with zero usage, they both end up with the same Priority.
On Aug 10, 2024, at 8:05 AM, Daniel M. Drucker <ddrucker@mclean.harvard.edumailto:ddrucker@mclean.harvard.edu> wrote:
I just set PriorityFlags=NO_FAIR_TREE and this seems to have solved the problem!
On Aug 10, 2024, at 7:45 AM, Drucker, Daniel <DDRUCKER@MCLEAN.HARVARD.EDUmailto:DDRUCKER@MCLEAN.HARVARD.EDU> wrote:
According to https://docs.rc.fas.harvard.edu/kb/fairshare/ and https://slurm.schedmd.com/SUG14/fair_tree.pdf :
"The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares)"
This is clearly not happening on my system:
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------ ... mic acataldo parent 0.991736 13066208 0.210193 0.210193 0.983871 cpu=169648,mem=1389757781,ene+ mic achowdhury parent 0.991736 0 0.000000 0.000000 0.983871 cpu=0,mem=0,energy=0,node=0,b+ ...
Every user has 0.991736 NormShares. Acataldo has EffectvUsage = 0.210193 Achowdhury has EffectvUsage = 0
But both users have the same FairShare. The correct values according to the above formula would be 0.863 and 1.0 respectively.
So what's going on?
On Aug 10, 2024, at 7:36 AM, Daniel M. Drucker <ddrucker@mclean.harvard.edumailto:ddrucker@mclean.harvard.edu> wrote:
Here is what is confusing me I guess. Look at the below. You can see that some people have no usage and some people have a lot of usage. But their FairShare value is all identical.
https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd... seems to say that fairshare=parent should work just fine, but what I am seeing is that it is NOT altering people's FairShare?
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
On Aug 9, 2024, at 9:15 PM, Fulcomer, Samuel samuel_fulcomer@brown.edu wrote: ...and what are the top 10-15 lines in your share output?...
See the 4:10PM message in this thread.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
I have never used Slurm where I have not added users explicitly first so I am not sure what happens in that case. But from your sshare output it certainly seems it default to fairshare=parent
Trying modify the users with
sacctmgr modify user $username fairshare=200
and then run sshare -a -A mic to see what has changed.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Fri, 9 Aug 2024 5:57pm, Drucker, Daniel wrote:
Hi Paul from over at mclean.harvard.eduhttp://mclean.harvard.edu!
I have never added any users using sacctmgr - I've always just had everyone I guess automatically join the default account, mic. Are you saying that is what is causing my problem?
I'm confused I guess because I would have expected that within an account - even if there is only one - users would get their 'fair share' of resources, rather than just defaulting to FIFO or something. But that doesn't seem to be the case.
I do not want any particular user to start out with more priority than any other particular user - I just want to make sure that if user A submits a million jobs at noon, and user B submits one job at 12:01, user B doesn't have to wait until those million jobs finish.
Daniel
On Aug 9, 2024, at 5:47 PM, Paul Raines raines@nmr.mgh.harvard.edu wrote:
This depends on how you have assigned fairshare in sacctmgr when creating the accounts and users. At our site we want fairshare only on accounts and not users, just like you are seeing, so we create accounts with
sacctmgr -i add account $acct Description="$descr" \ fairshare=200 GrpJobsAccrue=8
and users with
sacctmgr -i add user "$u" account=$acct fairshare=parent
If you want users to have their own independent fairshare, you do not use fairshare=parent but assign a real number.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Fri, 9 Aug 2024 5:20pm, Drucker, Daniel via slurm-users wrote:
External Email - Use Caution
I got the opposite result. When I submitted a job as bsmith, they got a lower priority (the number was smaller) than the job submitted as csmith.
bsmith (who has never submitted a job before) got a priority of 98387 (which is 10000 times the 0.983871 FairShare), whereas csmith (who is already running a huge number of jobs and has been for days now) got a priority of 103749.
On Aug 9, 2024, at 5:11 PM, Renfro, Michael Renfro@tntech.edu wrote:
External Email - Use Caution
The format has changed a bit, since none of our RawShares column is ‘parent’.
But you can test this to be certain.
If your cluster already has jobs pending, have bsmith (who has zero usage) and csmith (who has a lot of usage, relatively) each submit several jobs into the pending queue. Alternatively, have bsmith and csmith submit jobs with larger resource requests: jobs that are large enough to automatically go into a pending state due to lack of resources. Those might be jobs that request the whole cluster, even.
bsmith’s jobs should get a higher priority as seen from sprio, and bsmith’s jobs should start earlier than csmith’s. The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
NormShares changes to '1' for any user I modify like that. Everyone else has 0.991736. The "FairShare" column does not change.
On Aug 9, 2024, at 6:35 PM, Paul Raines raines@nmr.mgh.harvard.edu wrote:
I have never used Slurm where I have not added users explicitly first so I am not sure what happens in that case. But from your sshare output it certainly seems it default to fairshare=parent
Trying modify the users with
sacctmgr modify user $username fairshare=200
and then run sshare -a -A mic to see what has changed.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Fri, 9 Aug 2024 5:57pm, Drucker, Daniel wrote:
Hi Paul from over at mclean.harvard.eduhttp://mclean.harvard.edu!
I have never added any users using sacctmgr - I've always just had everyone I guess automatically join the default account, mic. Are you saying that is what is causing my problem?
I'm confused I guess because I would have expected that within an account - even if there is only one - users would get their 'fair share' of resources, rather than just defaulting to FIFO or something. But that doesn't seem to be the case.
I do not want any particular user to start out with more priority than any other particular user - I just want to make sure that if user A submits a million jobs at noon, and user B submits one job at 12:01, user B doesn't have to wait until those million jobs finish.
Daniel
On Aug 9, 2024, at 5:47 PM, Paul Raines raines@nmr.mgh.harvard.edu wrote:
This depends on how you have assigned fairshare in sacctmgr when creating the accounts and users. At our site we want fairshare only on accounts and not users, just like you are seeing, so we create accounts with
sacctmgr -i add account $acct Description="$descr" \ fairshare=200 GrpJobsAccrue=8
and users with
sacctmgr -i add user "$u" account=$acct fairshare=parent
If you want users to have their own independent fairshare, you do not use fairshare=parent but assign a real number.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Fri, 9 Aug 2024 5:20pm, Drucker, Daniel via slurm-users wrote:
External Email - Use Caution
I got the opposite result. When I submitted a job as bsmith, they got a lower priority (the number was smaller) than the job submitted as csmith.
bsmith (who has never submitted a job before) got a priority of 98387 (which is 10000 times the 0.983871 FairShare), whereas csmith (who is already running a huge number of jobs and has been for days now) got a priority of 103749.
On Aug 9, 2024, at 5:11 PM, Renfro, Michael Renfro@tntech.edu wrote:
External Email - Use Caution
The format has changed a bit, since none of our RawShares column is ‘parent’.
But you can test this to be certain.
If your cluster already has jobs pending, have bsmith (who has zero usage) and csmith (who has a lot of usage, relatively) each submit several jobs into the pending queue. Alternatively, have bsmith and csmith submit jobs with larger resource requests: jobs that are large enough to automatically go into a pending state due to lack of resources. Those might be jobs that request the whole cluster, even.
bsmith’s jobs should get a higher priority as seen from sprio, and bsmith’s jobs should start earlier than csmith’s. The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.