[slurm-users] [EXTERNAL] Re: Query about sbatch: Request X nodes in a range?

Em Dragowsky dragowsky at case.edu
Thu Jul 15 21:32:00 UTC 2021


Hi, Matt --

There is the 'exclude' flag, '-x' or '--exclude', that can take a list of
nodes to exclude from consideration. So it's selecting from a range, but by
denoting the nodes to exclude from the range. The nodes may be specified on
the command line, or denoted in a file that is referenced. For example,

#SBATCH --exclude ./exclude-avx.list

This approach is recommended to our researchers for running Gaussian
compiled with support for AVX2 instructions, because we do not define
features to identify nodes by instruction-set.

Cheers,

~ Em






On Wed, Jul 14, 2021 at 12:13 PM Thompson, Matt (GSFC-610.1)[SCIENCE
SYSTEMS AND APPLICATIONS INC] <matthew.thompson at nasa.gov> wrote:

> Max,
>
> Well, that was just an example. I also was doing the same things with like
> 125 node runs. Obviously, that's beyond a rack and your chances of hitting
> a down node increases! I guess I figured if I can under-specify a nodelist,
> maybe I could over-specify it!
>
> It's possible that using the topology bits could do it, but I'm also just
> a basic end-user with not too much knowledge of the system setup. This was
> the first time I even tried --nodelist (with help from the admins) so I was
> approaching it naively as you saw.
>
> Or, I suppose, is there a flag that one can pass to sbatch that gives the
> user a warning? That is:
>
>    Dear user, the allocation requested contains a node in a downed state.
>    This allocation will be PENDING for a while. You might want to rethink
> this.
>
> I guess SLURM knows all the downed nodes, so maybe? (But then again, maybe
> sbatch would get magnitudes slower if it had to query a database of all
> nodes and do checks to see this...)
>
> Matt
>
> PS: Or I guess I could stare at Ole's cool SLURM tools and figure out a
> way to have my own "job checker". Find the downed nodes, parse 'scontrol
> show job 1234', and display possible/impossible jobs :D
>
> --
> Matt Thompson, SSAI, Ld Scientific Programmer/Analyst
> NASA GSFC,    Global Modeling and Assimilation Office
> Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
> Phone: 301-614-6712                 Fax: 301-614-6246
> http://science.gsfc.nasa.gov/sed/bio/matthew.thompson
>
> On 7/14/21, 1:42 PM, "slurm-users on behalf of Max Voit" <
> slurm-users-bounces at lists.schedmd.com on behalf of
> max.voit_mlsl at with-eyes.net> wrote:
>
>     On Wed, 14 Jul 2021 17:04:45 +0000
>     "Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]"
>     <matthew.thompson at nasa.gov> wrote:
>
>     > Namely, I needed say, 20 nodes on a cluster on the same rack
>     > ...
>     > So, my question is, is there a way to say, "Please give me X nodes
>     > inside this specific range of nodes?"
>
>     Is the requirement actually the nodes being in the same rack, or rather
>     being connected to the same switch? For the latter: If you specify a
>     topology.conf file you can use --switch=... , c.f.
>
> https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Ftopology.html&data=04%7C01%7Cmatthew.thompson%40nasa.gov%7C3769db7ffe274ecca56a08d946eeb9cc%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637618813398627878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=nT%2FVHMkQeLZwnQq3ynNNDbvDjlmJ9e%2F6%2FyYVS89%2BMhc%3D&reserved=0
>
>     Best,
>     Max
>
>
>

-- 
E.M. (Em) Dragowsky, Ph.D.
Research Computing -- UTech
Case Western Reserve University
(216) 368-0082
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210715/38d5c205/attachment.htm>


More information about the slurm-users mailing list