Good afternoon,
I through I set everything correctly but when I try to restart the slurm server I get the following:
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: fetch_config: DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: _establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd initialization failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: _establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd initialization failed
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with result 'exit-code'.
Has anyone encountered this?
I read this is usually associated with configless Slurm, but I don't know how Slurm is built in BCM. slurm.conf is located in /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The environment variables for Slurm are set correctly so it points to this slurm.conf file.
One thing that I did not do was tell Slurm which cores Weka was using. I can seem to figure out the syntax for this. Can someone share the changes they made to slurm.conf?
Thanks!
Jeff