[slurm-users] I just had a "conversation" with ChatGPT about working DMTCP, OpenMPI and SLURM. Here are the results
Diego Zuccato
diego.zuccato at unibo.it
Mon Feb 13 07:31:30 UTC 2023
Hi.
I'm no expert, but it seems ChatGPT is confusing "queued" and "running"
jobs. Assuming you are interested in temporarily shutting down slurmctld
node for maintenance.
If the jobs are still queued ( == not yet running) what do you need to
save? The queue order is dynamically adjusted by slurmctld based on the
selected factors, there's nothing special to save.
For the running jobs, OTOH, you have multiple solutions:
1) drain the cluster: safest but often impractical
2) checkpoint: seems fragile, expecially if jobs span multiple nodes
3) have a second slurmd node (a small VM is sufficient) that takes over
the cluster management when the master node is down (be *sure* the state
dir is shared and quite fast!)
4) just hope you'll be able to recover the slurmctld node before a job
completes *and* the timeouts expire
While 4 is relatively risky (you could end up with runaway jobs that
you'll have to fix afterwards), it does not directly impact users: their
jobs will run and complete/fail regardless of slurmctld state. At most
the users won't receive a completion mail and they will be billed less
than expected.
Diego
Il 10/02/2023 20:06, Analabha Roy ha scritto:
> Hi,
>
> I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in
> my cluster. On a whim, I logged into ChatGPT and asked the AI about it.
> It told me things that I couldn't find in the current version of the
> SLURM docs (I looked). Since ChatGPT is not always reliable, I reproduce
> the
> contents of my chat session in my GitHub repository for peer review and
> commentary by you fine folks.
>
> https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md
> <https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md>
>
> I apologize for the poor formatting. I did this in a hurry, and my
> knowledge of markdown is rudimentary.
>
> Please do comment on the veracity and reliability of the AI's response.
>
> AR
>
> --
> Analabha Roy
> Assistant Professor
> Department of Physics
> <http://www.buruniv.ac.in/academics/department/physics>
> The University of Burdwan <http://www.buruniv.ac.in/>
> Golapbag Campus, Barddhaman 713104
> West Bengal, India
> Emails: daneel at utexas.edu <mailto:daneel at utexas.edu>,
> aroy at phys.buruniv.ac.in <mailto:aroy at phys.buruniv.ac.in>,
> hariseldon99 at gmail.com <mailto:hariseldon99 at gmail.com>
> Webpage: http://www.ph.utexas.edu/~daneel/
> <http://www.ph.utexas.edu/~daneel/>
--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786
More information about the slurm-users
mailing list