[slurm-users] After Each slurm Run, I Need to Reinstall slurm

Will Dennis wdennis at nec-labs.com
Sat May 5 19:26:33 MDT 2018


A few thoughts…

1) I am not sure Slurm can run “all-in-one” with controller/worker/acctg-db all on one host… If anyone else know if this is doable, please chime in (I actually have a request to do this for a single machine at work, where the researchers want to have many folks share a single GPU compute server by running a job scheduler, and submitting their jobs to it…)

2) Maybe throw your bash script on a web site (pastebin.com<http://pastebin.com>, Github gist, etc.) so we can take a look at what you are doing.

3) I am not sure how you are starting slurmctld / slurmd services the first time, but do you know if you are running them via systemd? (the Ubuntu service manager in 16.04/18.04) If so, what does 'systemctl status [slurmctld|slurmd]' output?

Let’s start with that.

HTH,
Will

On May 5, 2018, at 2:44 PM, Kenneth Russell <linux-ken at comcast.net<mailto:linux-ken at comcast.net>> wrote:

I am a new slurm user and am trying to set up a single node test system. I have spent endless hours trying to get slurm services to start. I am running Ubuntu Server V16.04 and slurm 17.11.5. My MB has an AMD 8 core processor. When I try to start slurmdbd or slurmctld services I get messages saying can't access shared libraries or pid files missing. At times, I noticed that the pid files in /var/run have been deleted. I have made copies of the pid files and copy them back to /var/run when they are missing.

I have found that if I reinstall slurm from the tarball, the services will start. To speed things up, I have created a bash script to reinstall slurm, starting with the tarball extraction step. This is a very inefficient work-around.

Can anyone help me solve the problem of why slurm runs only once and then fails on subsequent starts?

I can send copies of conf and log files if requested.

Thanks, in advance.

Ken Russell




More information about the slurm-users mailing list