I like it, however, it was working before without a slurm.conf in /etc/slurm.

Plus the environment variable SLURM_CONF is pointing to the correct slurm.conf file (the one in /cm/...). Wouldn't Slurm pick up that one?

Thanks!

Jeff


On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users <slurm-users@lists.schedmd.com> wrote:

This is because you have no slurm.conf in /etc/slurm, so it it is trying 'configless' which queries DNS to find out where to get the config. It is failing because you do not have DNS configured to tell nodes where to ask about the config.

Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s).

Brian Andrus

On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
Good afternoon,

I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base Command Manager which is based on Bright Cluster Manager). I ran into an error and only just learned that Slurm and Weka don't get along (presumably because Weka pins their client threads to cores). I read through their documentation here:  https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8

I through I set everything correctly but when I try to restart the slurm server I get the following:

Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: fetch_config: DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: _establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd initialization failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: _establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd initialization failed
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with result 'exit-code'.

Has anyone encountered this?

I read this is usually associated with configless Slurm, but I don't know how Slurm is built in BCM. slurm.conf is located in /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The environment variables for Slurm are set correctly so it points to this slurm.conf file.

One thing that I did not do was tell Slurm which cores Weka was using. I can seem to figure out the syntax for this. Can someone share the changes they made to slurm.conf?

Thanks!

Jeff



    

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com