[slurm-users] Migration of slurm communication network / Steps / how to
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Mon Apr 24 08:00:53 UTC 2023
On 4/24/23 08:56, Purvesh Parmar wrote:
> Thank you.. will try this and get back. Any other step being missed here
> for migration?
I don't know if any steps are missing, because I never tried moving a
cluster like you want to do.
/Ole
> On Mon, 24 Apr 2023 at 12:08, Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk
> <mailto:Ole.H.Nielsen at fysik.dtu.dk>> wrote:
>
> On 4/24/23 08:09, Purvesh Parmar wrote:
> > thank you, however, because this is change in the data center, the
> names
> > of the servers contain datacenter names as well in its hostname and in
> > fqdn as well, hence i have to change both, hostnames as well as ip
> > addresses, compulsorily, to given hostnames as per new DC names.
>
> Could your data center be persuaded to introduce DNS CNAME aliases for
> the
> old names to point to the new DC names?
>
> If you're forced to use new DNS names only, then it's simple to change
> DNS
> names of compute nodes and partitions in slurm.conf:
>
> NodeName=...
> PartitionName=xxx Nodes=...
>
> as well as the slurmdb server name:
>
> AccountingStorageHost=...
>
> What I have never tried before is to change the DNS name of the slurmctld
> host:
>
> ControlMachine=...
>
> The critical aspect here is that you need to stop all batch jobs, plus
> slurmdbd and slurmctld. Then you can backup (tar-ball) and transfer the
> Slurm state directories:
>
> StateSaveLocation=/var/spool/slurmctld
>
> However, I don't know if the name of the ControlMachine is hard-coded in
> the StateSaveLocation files?
>
> I strongly suggest that you try to make a test migration of the
> cluster to
> the new DC to find out if it works or not. Then you can always make
> multiple attempts without breaking anything.
>
> Best regards,
> Ole
>
>
> > On Mon, 24 Apr 2023 at 11:25, Ole Holm Nielsen
> <Ole.H.Nielsen at fysik.dtu.dk <mailto:Ole.H.Nielsen at fysik.dtu.dk>
> > <mailto:Ole.H.Nielsen at fysik.dtu.dk
> <mailto:Ole.H.Nielsen at fysik.dtu.dk>>> wrote:
> >
> > On 4/24/23 06:58, Purvesh Parmar wrote:
> > > thank you, but its change of hostnames as well, apart from ip
> > addresses
> > > as well of the slurm server, database serverver name and slurmd
> > compute
> > > nodes as well.
> >
> > I suggest that you talk to your networking people and request
> that the
> > old
> > DNS names be created in the new network's DNS for your Slurm
> cluster.
> > Then Ryan's solution will work. Changing DNS names is a very
> simple
> > matter!
> >
> > My 2 cents,
> > Ole
> >
> >
> > > On Mon, 24 Apr 2023 at 10:04, Ryan Novosielski
> > <novosirj at rutgers.edu <mailto:novosirj at rutgers.edu>
> <mailto:novosirj at rutgers.edu <mailto:novosirj at rutgers.edu>>
> > > <mailto:novosirj at rutgers.edu <mailto:novosirj at rutgers.edu>
> <mailto:novosirj at rutgers.edu <mailto:novosirj at rutgers.edu>>>> wrote:
> > >
> > > I think it’s easier than all of this. Are you actually
> changing
> > names
> > > of all of these things, or just IP addresses? It they all
> > resolve to
> > > an IP now and you can bring everything down and change the
> > hosts files
> > > or DNS, it seems to me that if the names aren’t changing,
> > that’s that.
> > > I know that “scontrol show cluster” will show the wrong IP
> > address but
> > > I think that updates itself.
> > >
> > > The names of the servers are in slurm.conf, but again,
> if the names
> > > don’t change, that won’t matter. If you have IPs there, you
> > will need
> > > to change them.
> > >
> > > Sent from my iPhone
> > >
> > > > On Apr 23, 2023, at 14:01, Purvesh Parmar
> > <purveshp0507 at gmail.com <mailto:purveshp0507 at gmail.com>
> <mailto:purveshp0507 at gmail.com <mailto:purveshp0507 at gmail.com>>
> > > <mailto:purveshp0507 at gmail.com
> <mailto:purveshp0507 at gmail.com>
> > <mailto:purveshp0507 at gmail.com
> <mailto:purveshp0507 at gmail.com>>>> wrote:
> > > >
> > > > Hello,
> > > >
> > > > We have slurm 21.08 on ubuntu 20. We have a cluster
> of 8 nodes.
> > > Entire slurm communication happens over 192.168.5.x
> network (LAN).
> > > However as per requirement, now we are migrating the
> cluster to
> > other
> > > premises and there we have 172.16.1.x (LAN). I have to
> migrate the
> > > entire network including SLURMDBD (mariadb), SLURMCTLD,
> SLURMD.
> > ALso
> > > the cluster network is also changing from 192.168.5.x to
> 172.16.1.x
> > > and each node will be assigned the ip address from the
> 172.16.1.x
> > > network.
> > > > The cluster has been running for the last 3 months
> and it is
> > > required to maintain the old usage stats as well.
> > > >
> > > >
> > > > Is the procedure correct as below :
> > > >
> > > > 1) Stop slurm
> > > > 2) suspend all the queued jobs
> > > > 3) backup slurm database
> > > > 4) change the slurm & munge configuration i.e. munge
> conf,
> > mariadb
> > > conf, slurmdbd.conf, slurmctld.conf, slurmd.conf (on compute
> > nodes),
> > > gres.conf, service file
> > > > 5) Later, do the update in the slurm database by
> executing below
> > > command
> > > > sacctmgr modify node where node=old_name set
> name=new_name
> > > > for all the nodes.
> > > > ALso, I think, slurm server name and slurmdbd server
> names
> > are also
> > > required to be updated. How to do it, still checking
> > > > 6) Finally, start slurmdbd, slurmctld on server and
> slurmd on
> > > compute nodes
> > > >
> > > > Please help and guide for above.
> > > >
> > > > Regards,
> > > >
> > > > Purvesh Parmar
> > > > INHAIT
> >
More information about the slurm-users
mailing list