Hi,
when I start an interactive job like this:
srun --pty --mem=3G -c2 bash
And then I schedule and run other jobs (can be interactive or non interactive) and one of these jobs that runs on the same node terminates, the interactive job gets killed with this message:
srun: error: node01.abc.at: task 0: Killed
I attached our slurm config. Does anybody have an idea what is going on here or where I could look to debug? I'm quite new to slurm, so I don't know all the places to look...
Thanks a lot in advance!
Thomas
Hi Thomas,
It could be a bug in slurm.epilog.clean. You could comment it out in slurm.conf and try again.
Regards, Götz Waschk
Hello Thomas,
I know I'm a few days late to this, so I'm wondering whether you've made any progress. We experience this, too, but in a different way.
First, though, you may be aware, but you should use salloc rather than srun --pty for an interactive session. That's been the preferred method for a while, and one reason is that you can't run an srun from within an srun. So I wonder whether that has something to do with it.
We run an old version of Open OnDemand, and what we see is when a user starts a virtual desktop session on a node and then submits a job with sbatch, once it terminates/dies, the virtual desktop session terminates too. I think this happens only when the job ends up on the same node on which the virtual desktop session is running. I haven't delved too deeply into that, but I suspect the virtual desktop session might be launched with an srun in some way, and somehow this is affected by something that happens when submitting an sbatch. I know that's super vague, but I haven't really gone too far with it, though the errors are similar (and in fact might be identical, it's been a few weeks!).
Warmest regards, Jason
On Thu, Feb 22, 2024 at 5:41 AM Thomas Hartmann via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi,
when I start an interactive job like this:
srun --pty --mem=3G -c2 bash
And then I schedule and run other jobs (can be interactive or non interactive) and one of these jobs that runs on the same node terminates, the interactive job gets killed with this message:
srun: error: node01.abc.at: task 0: Killed
I attached our slurm config. Does anybody have an idea what is going on here or where I could look to debug? I'm quite new to slurm, so I don't know all the places to look...
Thanks a lot in advance!
Thomas
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hi, sorry, I had written an email but it apparently didn't go through....
Götz was right. slurm.epilog.clean was the problem. There was a bug in there... I fixed it and now it works.
Best, Thomas