[slurm-users] I just had a "conversation" with ChatGPT about working DMTCP, OpenMPI and SLURM. Here are the results

Diego Zuccato diego.zuccato at unibo.it
Mon Feb 13 07:31:30 UTC 2023


Hi.

I'm no expert, but it seems ChatGPT is confusing "queued" and "running" 
jobs. Assuming you are interested in temporarily shutting down slurmctld 
node for maintenance.

If the jobs are still queued ( == not yet running) what do you need to 
save? The queue order is dynamically adjusted by slurmctld based on the 
selected factors, there's nothing special to save.
For the running jobs, OTOH, you have multiple solutions:
1) drain the cluster: safest but often impractical
2) checkpoint: seems fragile, expecially if jobs span multiple nodes
3) have a second slurmd node (a small VM is sufficient) that takes over 
the cluster management when the master node is down (be *sure* the state 
dir is shared and quite fast!)
4) just hope you'll be able to recover the slurmctld node before a job 
completes *and* the timeouts expire

While 4 is relatively risky (you could end up with runaway jobs that 
you'll have to fix afterwards), it does not directly impact users: their 
jobs will run and complete/fail regardless of slurmctld state. At most 
the users won't receive a completion mail and they will be billed less 
than expected.

Diego

Il 10/02/2023 20:06, Analabha Roy ha scritto:
> Hi,
> 
> I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in 
> my cluster. On a whim, I logged into ChatGPT and asked the AI about it.
> It told me things that I couldn't find in the current version of the 
> SLURM docs (I looked). Since ChatGPT is not always reliable, I reproduce 
> the
> contents of my chat session in my GitHub repository for peer review and 
> commentary by you fine folks.
> 
> https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md
> <https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md>
> 
> I apologize for the poor formatting. I did this in a hurry, and my 
> knowledge of markdown is rudimentary.
> 
> Please do comment on the veracity and reliability of the AI's response.
> 
> AR
> 
> -- 
> Analabha Roy
> Assistant Professor
> Department of Physics 
> <http://www.buruniv.ac.in/academics/department/physics>
> The University of Burdwan <http://www.buruniv.ac.in/>
> Golapbag Campus, Barddhaman 713104
> West Bengal, India
> Emails: daneel at utexas.edu <mailto:daneel at utexas.edu>, 
> aroy at phys.buruniv.ac.in <mailto:aroy at phys.buruniv.ac.in>, 
> hariseldon99 at gmail.com <mailto:hariseldon99 at gmail.com>
> Webpage: http://www.ph.utexas.edu/~daneel/ 
> <http://www.ph.utexas.edu/~daneel/>

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list