I would double-check where you are setting SLURM_CONF then. It is acting as if it is not set (typo maybe?)
It should be in /etc/defaults/slurmd (but could be /etc/sysconfig/slurmd).
Also check what the final, actual command being run to start it is. If anyone has changed the .service file or added an override file, that will affect things.
Brian Andrus
On 4/19/2024 10:15 AM, Jeffrey Layton wrote:
I like it, however, it was working before without a slurm.conf in /etc/slurm.
Plus the environment variable SLURM_CONF is pointing to the correct slurm.conf file (the one in /cm/...). Wouldn't Slurm pick up that one?
Thanks!
Jeff
On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users slurm-users@lists.schedmd.com wrote:
This is because you have no slurm.conf in /etc/slurm, so it it is trying 'configless' which queries DNS to find out where to get the config. It is failing because you do not have DNS configured to tell nodes where to ask about the config. Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s). Brian Andrus On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
Good afternoon, I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base Command Manager which is based on Bright Cluster Manager). I ran into an error and only just learned that Slurm and Weka don't get along (presumably because Weka pins their client threads to cores). I read through their documentation here: https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8 I through I set everything correctly but when I try to restart the slurm server I get the following: Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: fetch_config: DNS SRV lookup failed Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: _establish_configuration: failed to load configs Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd initialization failed Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS SRV lookup failed Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: _establish_configuration: failed to load configs Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd initialization failed Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with result 'exit-code'. Has anyone encountered this? I read this is usually associated with configless Slurm, but I don't know how Slurm is built in BCM. slurm.conf is located in /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The environment variables for Slurm are set correctly so it points to this slurm.conf file. One thing that I did not do was tell Slurm which cores Weka was using. I can seem to figure out the syntax for this. Can someone share the changes they made to slurm.conf? Thanks! Jeff
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com