[slurm-users] Problem with permisions. CentOS 7.8

Tue Jun 2 16:40:48 UTC 2020

Yes, you have both daemons, installed with the slurm rpm.The slurmd (all
nodes) communicates with slurmctld (runs in the main master node and,
optionally, in a backup node).

You do not need to run slurmd as the slurm user. Use `systemctld enable
slurmctld` (and slurmd) followed by `systemclt start slurmctld`. Use
restart instead of start if you change the configuration only if `sudo
scontrol reconfigure` asks for it.

If you run as root `slurmctld -Dvvvv` and `slurmd -Dvvvv` you'll see debug
outputs to see further problems with configuration. The slurmd needs
slurmctld running or will output "error: Unable to register: Unable to
contact slurm controller (connect failure)"

You should find the services here:
-rw-r--r-- 1 root root 339 may 30 20:18
/usr/lib/systemd/system/slurmctld.service
-rw-r--r-- 1 root root 342 may 30 20:18
/usr/lib/systemd/system/slurmdbd.service
-rw-r--r-- 1 root root 398 may 30 20:18
/usr/lib/systemd/system/slurmd.service

Feel free to ask for more information,
Best regards

El mar., 2 jun. 2020 a las 11:12, Ferran Planas Padros (<ferran.padros at su.se>)
escribió:

>
> Hi Ole,
>
>
> Thanks for your answer and your time. I'd appreciate if you, or someone
> else, could make a final look at my case.
>
> After your suggestions and comments, I have re-done the whole installation
> for Munge and Slurm. I uninstalled and remoced all previous rpms and
> restarted from scratch. Munge works with no problem, however it does not
> happen the same with slurm (for which I have used the instructions given in
> the link you attached)
>
>
> - If I run /usr/bin/slurmd -D vvvvv as root user, I get the verbose until
> the line 'slurmd: debug2: No acct_gather.conf file
> (/etc/slurm/acct_gather.conf)' where the verbose stops. After I do
> Ctrl+C, I get
>
>
> slurmd: all threads complete
>
> slurmd: Consumable Resources (CR) Node Selection plugin shutting down ...
>
> slurmd: Munge cryptographic signature plugin unloaded
>
> slurmd: Slurmd shutdown completing
>
> - After that, if I run 'systemctl start slurmd' and 'systemctl status
> slurmd', also as root user, I get:
>
> *●* slurmd.service - Slurm node daemon
>
>    Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
> preset: disabled)
>
>    Active: *active (running)* since Tue 2020-06-02 16:53:51 CEST; 33s ago
>
>   Process: 2750 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd
> $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
>
>  Main PID: 2752 (slurmd)
>
>    CGroup: /system.slice/slurmd.service
>
>            └─2752 /usr/sbin/slurmd -d /usr/sbin/slurmstepd
>
>
> Jun 02 16:53:51 roos21.organ.su.se systemd[1]: Starting Slurm node
> daemon...
>
> Jun 02 16:53:51 roos21.organ.su.se systemd[1]: Can't open PID file
> /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory
>
> Jun 02 16:53:51 roos21.organ.su.se systemd[1]: Started Slurm node daemon.
>
> - Next, I kill the slurmd process, and I run, as slurm user, 'systemctl
> start slurm'. Which does not work and returns the following in the
> journalctl -xe:
>
>
> Jun 02 16:56:01 roos21.organ.su.se systemd[1]: Starting LSB: slurm daemon
> management...
>
> -- Subject: Unit slurm.service has begun start-up
>
> -- Defined-By: systemd
>
> -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
>
> --
>
> -- Unit slurm.service has begun starting up.
>
> Jun 02 16:56:01 roos21.organ.su.se slurm[2805]: starting slurmd: [  OK  ]
>
> Jun 02 16:56:01 roos21.organ.su.se systemd[1]: Can't open PID file
> /var/run/slurmctld.pid (yet?) after start: No such file or directory
>
> Jun 02 16:56:37 roos21.organ.su.se polkitd[1316]: *Unregistered
> Authentication Agent for unix-process:2792:334647 (system bus name :1.46,
> object path /org/freedesktop*
>
> Jun 02 16:56:38 roos21.organ.su.se sudo[2790]: pam_unix(sudo:session):
> session closed for user slurm
>
> Something that I don't really understand because I have not installed
> slurmctld. The slurmctld.service file does not even exist.
>
>
> Any idea?
>
>
> Many thanks,
>
> Ferran
>
>
>
> ------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
> *Sent:* Tuesday, June 2, 2020 12:03:27 PM
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8
>
> Hi Ferran,
>
> Please install Slurm software in the standard way, see
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
>
> It seems that you have some unusual way to manage your Linux systems.  In
> Stockholm and Sweden there are many Slurm experts at the HPC centers which
> might be able to help you more directly.
>
> Best regards,
> Ole
>
> On 6/2/20 11:58 AM, Ferran Planas Padros wrote:
> > I did a fresh installation with the EPEL repo, and installing munge from
> > it and it worked. To have the slurm user for munge was definitely a
> > problem, but that is the set up we have on the CentOS 6. Now I've learnt
> > my lesson for future installations, thanks to everyone!
> >
> >
> > Now, I have a follow up question, if you don't mind. I am now trying to
> > run slurm, and it crashes:
> >
> >
> > [root at roos21 ~]# systemctl status slurm.service
> >
> > *●*slurm.service - LSB: slurm daemon management
> >
> > Loaded: loaded (/etc/rc.d/init.d/slurm; bad; vendor preset: disabled)
> >
> > Active: *failed*(Result: protocol) since Tue 2020-06-02 11:45:33 CEST;
> > 3min 33s ago
> >
> > Docs: man:systemd-sysv-generator(8)
> >
> >
> > Jun 02 11:45:33 roos21.organ.su.se systemd[1]: Starting LSB: slurm
> daemon
> > management...
> >
> > Jun 02 11:45:33 roos21.organ.su.se slurm[18223]: starting slurmd: [OK]
> >
> > Jun 02 11:45:33 roos21.organ.su.se systemd[1]: Can't open PID file
> > /var/run/slurmctld.pid (yet?) after start: No such file or directory
> >
> > Jun 02 11:45:33 roos21.organ.su.se systemd[1]: *Failed to start LSB:
> slurm
> > daemon management.*
> >
> > Jun 02 11:45:33 roos21.organ.su.se systemd[1]: *Unit slurm.service
> entered
> > failed state.*
> >
> > Jun 02 11:45:33 roos21.organ.su.se systemd[1]: *slurm.service failed.*
> >
> >
> >
> > The thing is that this is a computing node, not the master node, so
> > slurmctld is not installed. Why do I get this error?
> >
> >
> > Many thanks, and my apologies for this rather simple questions. I am a
> > newbie on this.
> >
> >
> > Best,
> >
> > Ferran
> >
> >
> --------------------------------------------------------------------------
> > *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf
> of
> > Renata Maria Dart <renata at slac.stanford.edu>
> > *Sent:* Friday, May 29, 2020 6:33:58 PM
> > *To:* Ole.H.Nielsen at fysik.dtu.dk; Slurm User Community List
> > *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8
> > Hi, don't know if this might be your problem but I ran into an issue
> > on centos 7.8 where /var/run/munge was not being created at boottime
> > because I didn't have the munge user in the local password file.  I
> > have the munge user in AD and once the system is up I can start munge
> > successfully, but AD wasn't available early enough during boot for the
> > munge startup to see it.  I added these lines to the munge systemctl
> > file:
> >
> > PermissionsStartOnly=true
> > ExecStartPre=-/usr/bin/mkdir -m 0755 -p /var/run/munge
> > ExecStartPre=-/usr/bin/chown -R munge:munge /var/run/munge
> >
> > and my system now starts munge up fine during a reboot.
> >
> > Renata
> >
> > On Fri, 29 May 2020, Ole Holm Nielsen wrote:
> >
> >> Hi Ferran,
> >>
> >> When you have a CentOS 7 system with the EPEL repo enabled, and you have
> >> installed the munge RPM from EPEL, then things should be working
> correctly.
> >>
> >> Since systemctl tells you that Munge service didn't start correctly,
> then it
> >> seems to me that you have a problem in the general configuration of
> your CentOS
> >> 7 system.  You should check /var/log/messages and "journalctl -xe" for
> munge
> >> errors.  It is really hard for other people to guess what may be wrong
> in your
> >> system.
> >>
> >> My 2 cents worth: Maybe you could make a fresh CentOS 7.8 installation
> on a
> >> test system and install the Munge service (and nothing else) according
> to
> >> instructions in https://wiki.fysik.dtu.dk/niflheim/Slurm_installation.
> This
> >> *really* has got to work!
> >>
> >> /Ole
> >>
> >>
> >> On 29-05-2020 10:23, Ferran Planas Padros wrote:
> >>> Hello everyone,
> >>>
> >>>
> >>> Here it comes everything I've done.
> >>>
> >>>
> >>> - About Ole's answer:
> >>>
> >>> Yes, we have slurm as the user to control munge. Following your
> comment, I
> >>> have changed the ownership of the munge files and tried to start munge
> as
> >>> munge user. However, it also failed.
> >>>
> >>> Also, I first installed munge from a repository. I've seen your
> suggestion of
> >>> installing from EPEL. So I uninstalled and installed again. Same result
> >>>
> >>> - About SELinux: It is disables
> >>>
> >>> - The output of ps -ef | grep munge is:
> >>>
> >>>
> >>> root534051530 10:18 pts/000:00:00 grep --color=auto *munge*
> >>>
> >>>
> >>> - The outputs of munge -n is:
> >>>
> >>>
> >>> Failed to access "/var/run/munge/munge.socket.2": No such file or
> directory
> >>>
> >>>
> >>> - Same for unmunge
> >>>
> >>>
> >>> - Output for sudo systemctl status --full munge
> >>>
> >>>
> >>> *?*munge.service - MUNGE authentication service
> >>>
> >>> Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor
> preset:
> >>> disabled)
> >>>
> >>> Active: *failed*(Result: exit-code) since Fri 2020-05-29 10:15:52
> CEST; 4min
> >>> 18s ago
> >>>
> >>> Docs: man:munged(8)
> >>>
> >>> Process: 5333 ExecStart=/usr/sbin/munged *(code=exited,
> status=1/FAILURE)*
> >>>
> >>>
> >>> May 29 10:15:52 roos21.organ.su.se systemd[1]: Starting MUNGE
> authentication
> >>> service...
> >>>
> >>> May 29 10:15:52 roos21.organ.su.se systemd[1]: *munge.service:
> control process
> >>> exited, code=exited status=1*
> >>>
> >>> May 29 10:15:52 roos21.organ.su.se systemd[1]: *Failed to start MUNGE
> >>> authentication service.*
> >>>
> >>> May 29 10:15:52 roos21.organ.su.se systemd[1]: *Unit munge.service
> entered
> >>> failed state.*
> >>>
> >>> May 29 10:15:52 roos21.organ.su.se systemd[1]: *munge.service failed.*
> >>>
> >>>
> >>> - Regarding NTP, I get this message:
> >>>
> >>>
> >>> Unable to talk to NTP daemon. Is it running?
> >>>
> >>>
> >>> It is the same message I get in the nodes that DO work. All nodes are
> sync in
> >>> time and date with the central node
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------
> >>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf
> of Ole
> >>> Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
> >>> *Sent:* Friday, May 29, 2020 9:56:10 AM
> >>> *To:* slurm-users at lists.schedmd.com
> >>> *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8
> >>> On 29-05-2020 08:46, Sudeep Narayan Banerjee wrote:
> >>>> also check:
> >>>> a) whether NTP has been setup and communicating with master node
> >>>> b) iptables may be flushed (iptables -L)
> >>>> c) SeLinux to disabled, to check :
> >>>> getenforce
> >>>> vim /etc/sysconfig/selinux
> >>>> (change SELINUX=enforcing to SELINUX=disabled and save the file and
> reboot)
> >>>
> >>> There is no reason to disable SELinux for running the Munge service.
> >>> It's a pretty bad idea to lower the security just for the sake of
> >>> convenience!
> >>>
> >>> /Ole
> >>>
> >>>
> >>>> On Fri, May 29, 2020 at 12:08 PM Sudeep Narayan Banerjee
> >>>> <snbanerjee at iitgn.ac.in <mailto:snbanerjee at iitgn.ac.in
> <snbanerjee at iitgn.ac.in>>> wrote:
> >>>>
> >>>>      I have not checked on the CentOS7.8
> >>>>      a) if /var/run/munge folder does not exist then please double
> check
> >>>>      whether munge has been installed or not
> >>>>      b) user root or sudo user to do
> >>>>      ps -ef | grep munge
> >>>>      kill -9 <PID> //where PID is the Process ID for munge (if the
> >>>>      process is running at all); else
> >>>>
> >>>>      which munged
> >>>>      /etc/init.d/munge start
> >>>>
> >>>>      please let me know the the output of:
> >>>>
> >>>>      |$ munge -n|
> >>>>
> >>>>      |$ munge -n | unmunge|
> >>>>
> >>>>      |$ sudo systemctl status --full munge
> >>>>
> >>>>      |
> >>>>
> >>>>      Thanks & Regards,
> >>>>      Sudeep Narayan Banerjee
> >>>>      System Analyst | Scientist B
> >>>>      Indian Institute of Technology Gandhinagar
> >>>>      Gujarat, INDIA
> >>>>
> >>>>
> >>>>      On Fri, May 29, 2020 at 11:55 AM Bjørn-Helge Mevik
> >>>>      <b.h.mevik at usit.uio.no <mailto:b.h.mevik at usit.uio.no
> <b.h.mevik at usit.uio.no>>> wrote:
> >>>>
> >>>>          Ferran Planas Padros <ferran.padros at su.se
> >>>>          <mailto:ferran.padros at su.se <ferran.padros at su.se>>> writes:
> >>>>
> >>>>           > I run the command as slurm user, and the /var/log/munge
> >>>>          folder does belong to slurm.
> >>>>
> >>>>          For security reasons, I strongly advise that you run munged
> as a
> >>>>          separate user, which is unprivileged and not used for
> anything else.
> >>>>
> >>>>          --          Regards,
> >>>>          Bjørn-Helge Mevik, dr. scient,
> >>>>          Department for Research Computing, University of Oslo
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200602/487fe3c8/attachment-0001.htm>