[slurm-users] [EXTERNAL] Re: Query about sbatch: Request X nodes in a range?
Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
matthew.thompson at nasa.gov
Wed Jul 14 19:10:53 UTC 2021
Well, that was just an example. I also was doing the same things with like 125 node runs. Obviously, that's beyond a rack and your chances of hitting a down node increases! I guess I figured if I can under-specify a nodelist, maybe I could over-specify it!
It's possible that using the topology bits could do it, but I'm also just a basic end-user with not too much knowledge of the system setup. This was the first time I even tried --nodelist (with help from the admins) so I was approaching it naively as you saw.
Or, I suppose, is there a flag that one can pass to sbatch that gives the user a warning? That is:
Dear user, the allocation requested contains a node in a downed state.
This allocation will be PENDING for a while. You might want to rethink this.
I guess SLURM knows all the downed nodes, so maybe? (But then again, maybe sbatch would get magnitudes slower if it had to query a database of all nodes and do checks to see this...)
PS: Or I guess I could stare at Ole's cool SLURM tools and figure out a way to have my own "job checker". Find the downed nodes, parse 'scontrol show job 1234', and display possible/impossible jobs :D
Matt Thompson, SSAI, Ld Scientific Programmer/Analyst
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246
On 7/14/21, 1:42 PM, "slurm-users on behalf of Max Voit" <slurm-users-bounces at lists.schedmd.com on behalf of max.voit_mlsl at with-eyes.net> wrote:
On Wed, 14 Jul 2021 17:04:45 +0000
"Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]"
<matthew.thompson at nasa.gov> wrote:
> Namely, I needed say, 20 nodes on a cluster on the same rack
> So, my question is, is there a way to say, "Please give me X nodes
> inside this specific range of nodes?"
Is the requirement actually the nodes being in the same rack, or rather
being connected to the same switch? For the latter: If you specify a
topology.conf file you can use --switch=... , c.f.
More information about the slurm-users