[slurm-users] I just had a "conversation" with ChatGPT about working DMTCP, OpenMPI and SLURM. Here are the results

Analabha Roy hariseldon99 at gmail.com
Sat Feb 18 18:40:18 UTC 2023


Hi,

On Mon, 13 Feb 2023, 13:04 Diego Zuccato, <diego.zuccato at unibo.it> wrote:

> Hi.
>
> I'm no expert, but it seems ChatGPT is confusing "queued" and "running"
> jobs.


That's what I also suspected.



Assuming you are interested in temporarily shutting down slurmctld
> node for maintenance.
>


Temporarily and daily.





>
>
>
>
>
>
> If the jobs are still queued ( == not yet running) what do you need to
> save? The queue order is dynamically adjusted by slurmctld based on the
> selected factors, there's nothing special to save.
> For the running jobs, OTOH, you have multiple solutions:
> 1) drain the cluster: safest but often impractical
> 2) checkpoint: seems fragile, expecially if jobs span multiple nodes
>

I just have one node, but the bigger problem with check pointing is that
GPUs don't seem to be supported.


3) have a second slurmd node (a small VM is sufficient) that takes over
> the cluster management when the master node is down (be *sure* the state
> dir is shared and quite fast!)
>

I've just got that one "node" for compute and login and storage and
everything.

It's a Tyrone server with 64 cores and a couplea raided hdds. Just wanna
run some DFT/QM/MM simulations for myself and departmental colleagues, and
do some exact diagonalization problems.




4) just hope you'll be able to recover the slurmctld node before a job
> completes *and* the timeouts expire
>


I booted into gparted live and beefed up the swap space to 200 gigs (the
ram is 93 G). I've setup a mandatory (through qos settings) Slurm
reservation that kills all running jobs in the normal qos after 8:30 pm
everyday and a cron job that starts @ 835 pm, drains the partitions,
suspends all jobs running on elevated qos privileges, then hibernates the
whole sumbich to swap. Another script runs whenever the fella comes outta
hibernation, resets the slurm partitions and resumes the suspended jobs.

Its an ugly jugaad, I know.

I guess it's tough noogies for the normal qos people if their jobs ran past
the reservation or were not properly checkpointed before a blackout, but I
don't see any other alternative.

 My department refuses to let me run my thingie 24/7, and power outages
occur frequently round here.

I'm concerned about implementing a failsafe in case this Rube Goldberg like
setup takes a hard left.

Was thinking about a systemd service that kills all running jobs, then
simply runs "scontrol shutdown" to preserve the state of queued jobs and
then resumes a regular system shutdown. In that case, automatic
checkpointing of the jobs with dmtcp/mana would be cool, and I was
encouraged when chatgpt claimed that slurm supported this. But the recent
docs don't corroborate this claim,so I guess it got deprecated or
something...










> While 4 is relatively risky (you could end up with runaway jobs that
> you'll have to fix afterwards), it does not directly impact users: their
> jobs will run and complete/fail regardless of slurmctld state. At most
> the users won't receive a completion mail and they will be billed less
> than expected.
>
> Diego
>
> Il 10/02/2023 20:06, Analabha Roy ha scritto:
> > Hi,
> >
> > I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in
> > my cluster. On a whim, I logged into ChatGPT and asked the AI about it.
> > It told me things that I couldn't find in the current version of the
> > SLURM docs (I looked). Since ChatGPT is not always reliable, I reproduce
> > the
> > contents of my chat session in my GitHub repository for peer review and
> > commentary by you fine folks.
> >
> > https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md
> > <https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md>
> >
> > I apologize for the poor formatting. I did this in a hurry, and my
> > knowledge of markdown is rudimentary.
> >
> > Please do comment on the veracity and reliability of the AI's response.
> >
> > AR
> >
> > --
> > Analabha Roy
> > Assistant Professor
> > Department of Physics
> > <http://www.buruniv.ac.in/academics/department/physics>
> > The University of Burdwan <http://www.buruniv.ac.in/>
> > Golapbag Campus, Barddhaman 713104
> > West Bengal, India
> > Emails: daneel at utexas.edu <mailto:daneel at utexas.edu>,
> > aroy at phys.buruniv.ac.in <mailto:aroy at phys.buruniv.ac.in>,
> > hariseldon99 at gmail.com <mailto:hariseldon99 at gmail.com>
> > Webpage: http://www.ph.utexas.edu/~daneel/
> > <http://www.ph.utexas.edu/~daneel/>
>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230219/4e950184/attachment.htm>


More information about the slurm-users mailing list