[slurm-users] Re: Integrating Slurm with WekaIO

19 Apr 2024


      ...
Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s).
For Bright slurm.conf is in /cm/shared/apps/slurm/var/etc/slurm including
on all nodes. Make sure on the compute nodes $SLURM_CONF resolves to the
correct path.
...
On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
Good afternoon,
I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base
Command Manager which is based on Bright Cluster Manager). I ran into an
error and only just learned that Slurm and Weka don't get along (presumably
because Weka pins their client threads to cores). I read through their
documentation here:
https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading...
https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.weka.io_best-2Dpractice-2Dguides_weka-2Dand-2Dslurm-2Dintegration-23heading-2Dh.4d34og8&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=riz-0X0LPn0cnv9ZMLczgVxBYdYUQ9BTVAb1jjFs6bl55R7uEdJa7GqKshb2T9DU&s=kqzuSBqwrYnFL6zB-ICnPbW9Z-6SgIr0MpjIJuy8Qls&e=
I through I set everything correctly but when I try to restart the slurm
server I get the following:
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
fetch_config: DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error:
_establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd
initialization failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS
SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error:
_establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd
initialization failed
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process
exited, code=exited, status=1/FAILURE
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with
result 'exit-code'.
Has anyone encountered this?
I read this is usually associated with configless Slurm, but I don't know
how Slurm is built in BCM. slurm.conf is located in
/cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The
environment variables for Slurm are set correctly so it points to this
slurm.conf file.
One thing that I did not do was tell Slurm which cores Weka was using. I
can seem to figure out the syntax for this. Can someone share the changes
they made to slurm.conf?
Thanks!
Jeff
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

2025

2024

[slurm-users] Re: Integrating Slurm with WekaIO