[slurm-users] Can sinfo/scontrol be called from job_submit.lua?
rug262 at psu.edu
Wed Oct 12 13:40:36 UTC 2022
Well, there are numerous ways to do it, but I was trying to do it as much as possible from within the slurm infrastructure.
Basically, I want to react when someone submits a job requesting specific features that aren't actively available yet, and some of the actions I need to take will involve slurm commands. This seems a bit like the cloud scheduling interface, but it's not a cloud service I'm talking about...it's our own hardware.
Otherwise, I would think that gathering information to make a decision while in the job_submit.lua would be a normal expectation. Is there really no way to know how many nodes are up or what features are on the system while I'm processing in the job submit? sacctmgr seems to work fine in there.
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Thomas M. Payerle <payerle at umd.edu>
Sent: Tuesday, October 11, 2022 5:31 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Can sinfo/scontrol be called from job_submit.lua?
Running scontrol/sinfo from within a job_submit.lua script seems to be opening a big can of worms --- it might be doable, but it would scare me. Since it sounds like you are only doing such for a fairly limited amount of information which presumably does not change frequently, perhaps it would be better to have a cron job periodically output the desired information to a file, and have the job_submit.lua read the information from the file?
On Tue, Oct 11, 2022 at 5:17 PM Groner, Rob <rug262 at psu.edu<mailto:rug262 at psu.edu>> wrote:
I am testing a method where, when a job gets submitted asking for specific features, then, if those features don't exist, I'll do something.
The job_submit.lua plugin has worked to determine when a job is submitted asking for the specific features. I'm at the point of checking if those features exist already (the features are part of a nodeset and part of a partition....so jobs submitted asking for those features will just go to pending if no nodes exist that offer those features). I thought to use "sinfo" to get a list of existing features on the system...but it fails to run. The same for trying to use scontrol.
When I submit a job that requests the features, and so the sinfo command runs, it all hangs for about 10 seconds and then says:
[me at testsch (RC) slurm] sbatch ./gctest_account_test.sh
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
In the slurmctld.log, I see:
[2022-10-10T17:12:13.933] error: slurm_msg_sendto: address:port=10.6.88.99:40100<https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2F10.6.88.99%3A40100%2F&data=05%7C01%7Crug262%40psu.edu%7C9e64da18790c4f1cbdf408daabd00f06%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638011207348366909%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=XSj6pugGmHf5Hcq5eBFRuxEoJwJn6Sx3gFzm6ooYaA4%3D&reserved=0> msg_type=4004: Unexpected missing socket error
I'll note that "sinfo -V" works...but I suspect it's because it's not trying to communicate outside of itself with the slurmctld.
Any suggestions on what to try? Or is there a better slurm-ic way to do what I'm trying to do?
DIT-ACIGS/Mid-Atlantic Crossroads payerle at umd.edu<mailto:payerle at umd.edu>
5825 University Research Park (301) 405-6135
University of Maryland
College Park, MD 20740-3831
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users