[slurm-users] Migration of slurm communication network / Steps / how to

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Mon Apr 24 08:00:53 UTC 2023


On 4/24/23 08:56, Purvesh Parmar wrote:
> Thank you.. will try this and get back. Any other step being missed here 
> for migration?

I don't know if any steps are missing, because I never tried moving a 
cluster like you want to do.

/Ole

> On Mon, 24 Apr 2023 at 12:08, Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk 
> <mailto:Ole.H.Nielsen at fysik.dtu.dk>> wrote:
> 
>     On 4/24/23 08:09, Purvesh Parmar wrote:
>      > thank you, however, because this is change in the data center, the
>     names
>      > of the servers contain datacenter names as well in its hostname and in
>      > fqdn as well, hence i have to change both, hostnames as well as ip
>      > addresses, compulsorily, to given hostnames as per new DC names.
> 
>     Could your data center be persuaded to introduce DNS CNAME aliases for
>     the
>     old names to point to the new DC names?
> 
>     If you're forced to use new DNS names only, then it's simple to change
>     DNS
>     names of compute nodes and partitions in slurm.conf:
> 
>     NodeName=...
>     PartitionName=xxx Nodes=...
> 
>     as well as the slurmdb server name:
> 
>     AccountingStorageHost=...
> 
>     What I have never tried before is to change the DNS name of the slurmctld
>     host:
> 
>     ControlMachine=...
> 
>     The critical aspect here is that you need to stop all batch jobs, plus
>     slurmdbd and slurmctld.  Then you can backup (tar-ball) and transfer the
>     Slurm state directories:
> 
>     StateSaveLocation=/var/spool/slurmctld
> 
>     However, I don't know if the name of the ControlMachine is hard-coded in
>     the StateSaveLocation files?
> 
>     I strongly suggest that you try to make a test migration of the
>     cluster to
>     the new DC to find out if it works or not.  Then you can always make
>     multiple attempts without breaking anything.
> 
>     Best regards,
>     Ole
> 
> 
>      > On Mon, 24 Apr 2023 at 11:25, Ole Holm Nielsen
>     <Ole.H.Nielsen at fysik.dtu.dk <mailto:Ole.H.Nielsen at fysik.dtu.dk>
>      > <mailto:Ole.H.Nielsen at fysik.dtu.dk
>     <mailto:Ole.H.Nielsen at fysik.dtu.dk>>> wrote:
>      >
>      >     On 4/24/23 06:58, Purvesh Parmar wrote:
>      >      > thank you, but its change of hostnames as well, apart from ip
>      >     addresses
>      >      > as well of the slurm server, database serverver name and slurmd
>      >     compute
>      >      > nodes as well.
>      >
>      >     I suggest that you talk to your networking people and request
>     that the
>      >     old
>      >     DNS names be created in the new network's DNS for your Slurm
>     cluster.
>      >     Then Ryan's solution will work.  Changing DNS names is a very
>     simple
>      >     matter!
>      >
>      >     My 2 cents,
>      >     Ole
>      >
>      >
>      >      > On Mon, 24 Apr 2023 at 10:04, Ryan Novosielski
>      >     <novosirj at rutgers.edu <mailto:novosirj at rutgers.edu>
>     <mailto:novosirj at rutgers.edu <mailto:novosirj at rutgers.edu>>
>      >      > <mailto:novosirj at rutgers.edu <mailto:novosirj at rutgers.edu>
>     <mailto:novosirj at rutgers.edu <mailto:novosirj at rutgers.edu>>>> wrote:
>      >      >
>      >      >     I think it’s easier than all of this. Are you actually
>     changing
>      >     names
>      >      >     of all of these things, or just IP addresses? It they all
>      >     resolve to
>      >      >     an IP now and you can bring everything down and change the
>      >     hosts files
>      >      >     or DNS, it seems to me that if the names aren’t changing,
>      >     that’s that.
>      >      >     I know that “scontrol show cluster” will show the wrong IP
>      >     address but
>      >      >     I think that updates itself.
>      >      >
>      >      >     The names of the servers are in slurm.conf, but again,
>     if the names
>      >      >     don’t change, that won’t matter. If you have IPs there, you
>      >     will need
>      >      >     to change them.
>      >      >
>      >      >     Sent from my iPhone
>      >      >
>      >      >      > On Apr 23, 2023, at 14:01, Purvesh Parmar
>      >     <purveshp0507 at gmail.com <mailto:purveshp0507 at gmail.com>
>     <mailto:purveshp0507 at gmail.com <mailto:purveshp0507 at gmail.com>>
>      >      >     <mailto:purveshp0507 at gmail.com
>     <mailto:purveshp0507 at gmail.com>
>      >     <mailto:purveshp0507 at gmail.com
>     <mailto:purveshp0507 at gmail.com>>>> wrote:
>      >      >      > 
>      >      >      > Hello,
>      >      >      >
>      >      >      > We have slurm 21.08 on ubuntu 20. We have a cluster
>     of 8 nodes.
>      >      >     Entire slurm communication happens over 192.168.5.x
>     network (LAN).
>      >      >     However as per requirement, now we are migrating the
>     cluster to
>      >     other
>      >      >     premises and there we have 172.16.1.x (LAN). I have to
>     migrate the
>      >      >     entire network including SLURMDBD (mariadb), SLURMCTLD,
>     SLURMD.
>      >     ALso
>      >      >     the cluster network is also changing from 192.168.5.x to
>     172.16.1.x
>      >      >     and each node will be assigned the ip address from the
>     172.16.1.x
>      >      >     network.
>      >      >      > The cluster has been running for the last 3 months
>     and it is
>      >      >     required to maintain the old usage stats as well.
>      >      >      >
>      >      >      >
>      >      >      >  Is the procedure correct as below :
>      >      >      >
>      >      >      > 1) Stop slurm
>      >      >      > 2) suspend all the queued jobs
>      >      >      > 3) backup slurm database
>      >      >      > 4) change the slurm & munge configuration i.e. munge
>     conf,
>      >     mariadb
>      >      >     conf, slurmdbd.conf, slurmctld.conf, slurmd.conf (on compute
>      >     nodes),
>      >      >     gres.conf, service file
>      >      >      > 5) Later, do the update in the slurm database by
>     executing below
>      >      >     command
>      >      >      > sacctmgr modify node where node=old_name set
>     name=new_name
>      >      >      > for all the nodes.
>      >      >      > ALso, I think, slurm server name and slurmdbd server
>     names
>      >     are also
>      >      >     required to be updated. How to do it, still checking
>      >      >      > 6) Finally, start slurmdbd, slurmctld on server and
>     slurmd on
>      >      >     compute nodes
>      >      >      >
>      >      >      > Please help and guide for above.
>      >      >      >
>      >      >      > Regards,
>      >      >      >
>      >      >      > Purvesh Parmar
>      >      >      > INHAIT
>      >



More information about the slurm-users mailing list