[slurm-users] After Each slurm Run, I Need to Reinstall slurm
Will Dennis
wdennis at nec-labs.com
Sat May 5 19:26:33 MDT 2018
A few thoughts…
1) I am not sure Slurm can run “all-in-one” with controller/worker/acctg-db all on one host… If anyone else know if this is doable, please chime in (I actually have a request to do this for a single machine at work, where the researchers want to have many folks share a single GPU compute server by running a job scheduler, and submitting their jobs to it…)
2) Maybe throw your bash script on a web site (pastebin.com<http://pastebin.com>, Github gist, etc.) so we can take a look at what you are doing.
3) I am not sure how you are starting slurmctld / slurmd services the first time, but do you know if you are running them via systemd? (the Ubuntu service manager in 16.04/18.04) If so, what does 'systemctl status [slurmctld|slurmd]' output?
Let’s start with that.
HTH,
Will
On May 5, 2018, at 2:44 PM, Kenneth Russell <linux-ken at comcast.net<mailto:linux-ken at comcast.net>> wrote:
I am a new slurm user and am trying to set up a single node test system. I have spent endless hours trying to get slurm services to start. I am running Ubuntu Server V16.04 and slurm 17.11.5. My MB has an AMD 8 core processor. When I try to start slurmdbd or slurmctld services I get messages saying can't access shared libraries or pid files missing. At times, I noticed that the pid files in /var/run have been deleted. I have made copies of the pid files and copy them back to /var/run when they are missing.
I have found that if I reinstall slurm from the tarball, the services will start. To speed things up, I have created a bash script to reinstall slurm, starting with the tarball extraction step. This is a very inefficient work-around.
Can anyone help me solve the problem of why slurm runs only once and then fails on subsequent starts?
I can send copies of conf and log files if requested.
Thanks, in advance.
Ken Russell
More information about the slurm-users
mailing list