[slurm-users] Problem with permisions. CentOS 7.8

Tue Jun 2 20:04:53 UTC 2020

Hi Ferran,

The Slurm RPMs built in the standard way will not cause any errors with 
Systemd daemons.  You should not have any troubles on a correctly 
installed Slurm node.  That is why I think you need to look at other 
problems in your setup.

Which versions of Slurm do you run?

Which nodes run the old CentOS 6 and which Slurm versions?  You may have 
to upgrade to CentOS 7.

Please understand that you must not mix very old Slurm versions with new 
ones, see 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
The Slurm versions may only be mixed as follows:
slurmdbd >= slurmctld >= slurmd >= commands

So your CentOS 7.8 compute node running slurmd must not have a Slurm 
version newer than that of the slurmctld and slurmdbd nodes.

/Ole

On 02-06-2020 18:54, Ferran Planas Padros wrote:
> However, I am setting up a calculating node, not the master node, and 
> thus I have not installed slurmctld on it.
> 
> 
> After some digging, I have found that all these files:
> 
> /run/systemd/generator.late/slurm.service
> 
> /run/systemd/generator.late/runlevel5.target.wants/slurm.service
> 
> /run/systemd/generator.late/runlevel4.target.wants/slurm.service
> 
> /run/systemd/generator.late/runlevel3.target.wants/slurm.service
> 
> /run/systemd/generator.late/runlevel2.target.wants/slurm.service
> 
> 
> Which are a copy of each other and are generated by 
> systemd-sysv-generator, point to the slurmctld.pid, not to the slurm.pid
> 
> 
> [Unit]
> 
> Documentation=man:systemd-sysv-generator(8)
> 
> SourcePath=/etc/rc.d/init.d/slurm
> 
> Description=LSB: slurm daemon management
> 
> Before=runlevel2.target
> 
> Before=runlevel3.target
> 
> Before=runlevel4.target
> 
> Before=runlevel5.target
> 
> Before=shutdown.target
> 
> After=remote-fs.target
> 
> After=network-online.target
> 
> After=munge.service
> 
> After=nss-lookup.target
> 
> After=network-online.target
> 
> Wants=network-online.target
> 
> Conflicts=shutdown.target
> 
> 
> [Service]
> 
> Type=forking
> 
> Restart=no
> 
> TimeoutSec=5min
> 
> IgnoreSIGPIPE=no
> 
> KillMode=process
> 
> GuessMainPID=no
> 
> RemainAfterExit=no
> 
> *PIDFile=/var/run/slurmctld.pid*
> 
> ExecStart=/etc/rc.d/init.d/slurm start
> 
> ExecStop=/etc/rc.d/init.d/slurm stop
> 
> ~
> 
> 
> 
> How can I make it to avoid this? Besides editing the files manually, 
> which will go back to the original after reboot.
> 
> 
> Thanks,
> 
> Ferran
> 
> 
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of 
> Rodrigo Santibáñez <rsantibanez.uchile at gmail.com>
> *Sent:* Tuesday, June 2, 2020 6:40:48 PM
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8
> Yes, you have both daemons, installed with the slurm rpm.The slurmd (all 
> nodes) communicates with slurmctld (runs in the main master node and, 
> optionally, in a backup node).
> 
> You do not need to run slurmd as the slurm user. Use `systemctld enable 
> slurmctld` (and slurmd) followed by `systemclt start slurmctld`. Use 
> restart instead of start if you change the configuration only if `sudo 
> scontrol reconfigure` asks for it.
> 
> If you run as root `slurmctld -Dvvvv` and `slurmd -Dvvvv` you'll see 
> debug outputs to see further problems with configuration. The slurmd 
> needs slurmctld running or will output "error: Unable to register: 
> Unable to contact slurm controller (connect failure)"
> 
> You should find the services here:
> -rw-r--r-- 1 root root 339 may 30 20:18 
> /usr/lib/systemd/system/slurmctld.service
> -rw-r--r-- 1 root root 342 may 30 20:18 
> /usr/lib/systemd/system/slurmdbd.service
> -rw-r--r-- 1 root root 398 may 30 20:18 
> /usr/lib/systemd/system/slurmd.service
> 
> Feel free to ask for more information,
> Best regards
> 
> El mar., 2 jun. 2020 a las 11:12, Ferran Planas Padros 
> (<ferran.padros at su.se <mailto:ferran.padros at su.se>>) escribió:
> 
> 
>     Hi Ole,
> 
> 
>     Thanks for your answer and your time. I'd appreciate if you, or
>     someone else, could make a final look at my case.
> 
>     After your suggestions and comments, I have re-done the whole
>     installation for Munge and Slurm. I uninstalled and remoced all
>     previous rpms and restarted from scratch. Munge works with no
>     problem, however it does not happen the same with slurm (for which I
>     have used the instructions given in the link you attached)
> 
> 
>     - If I run /usr/bin/slurmd -D vvvvv as root user, I get the verbose
>     until the line 'slurmd: debug2: No acct_gather.conf file
>     (/etc/slurm/acct_gather.conf)' where the verbose stops. After I do
>     Ctrl+C, I get
> 
> 
>     slurmd: all threads complete
> 
>     slurmd: Consumable Resources (CR) Node Selection plugin shutting
>     down ...
> 
>     slurmd: Munge cryptographic signature plugin unloaded
> 
>     slurmd: Slurmd shutdown completing
> 
> 
>     - After that, if I run 'systemctl start slurmd' and 'systemctl
>     status slurmd', also as root user, I get:
> 
>     *●*slurmd.service - Slurm node daemon
> 
>     Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
>     preset: disabled)
> 
>     Active: *active (running)*since Tue 2020-06-02 16:53:51 CEST; 33s ago
> 
>     Process: 2750 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd
>     $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
> 
>     Main PID: 2752 (slurmd)
> 
>     CGroup: /system.slice/slurmd.service
> 
>     └─2752 /usr/sbin/slurmd -d /usr/sbin/slurmstepd
> 
> 
>     Jun 02 16:53:51 roos21.organ.su.se <http://roos21.organ.su.se>
>     systemd[1]: Starting Slurm node daemon...
> 
>     Jun 02 16:53:51 roos21.organ.su.se <http://roos21.organ.su.se>
>     systemd[1]: Can't open PID file /var/run/slurm/slurmd.pid (yet?)
>     after start: No such file or directory
> 
>     Jun 02 16:53:51 roos21.organ.su.se <http://roos21.organ.su.se>
>     systemd[1]: Started Slurm node daemon.
> 
> 
>     - Next, I kill the slurmd process, and I run, as slurm user,
>     'systemctl start slurm'. Which does not work and returns the
>     following in the journalctl -xe:
> 
> 
>     Jun 02 16:56:01 roos21.organ.su.se <http://roos21.organ.su.se>
>     systemd[1]: Starting LSB: slurm daemon management...
> 
>     -- Subject: Unit slurm.service has begun start-up
> 
>     -- Defined-By: systemd
> 
>     -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
> 
>     --
> 
>     -- Unit slurm.service has begun starting up.
> 
>     Jun 02 16:56:01 roos21.organ.su.se <http://roos21.organ.su.se>
>     slurm[2805]: starting slurmd: [OK]
> 
>     Jun 02 16:56:01 roos21.organ.su.se <http://roos21.organ.su.se>
>     systemd[1]: Can't open PID file /var/run/slurmctld.pid (yet?) after
>     start: No such file or directory
> 
>     Jun 02 16:56:37 roos21.organ.su.se <http://roos21.organ.su.se>
>     polkitd[1316]: *Unregistered Authentication Agent for
>     unix-process:2792:334647 (system bus name :1.46, object path
>     /org/freedesktop*
> 
>     Jun 02 16:56:38 roos21.organ.su.se <http://roos21.organ.su.se>
>     sudo[2790]: pam_unix(sudo:session): session closed for user slurm
> 
> 
>     Something that I don't really understand because I have not
>     installed slurmctld. The slurmctld.service file does not even exist.
> 
> 
>     Any idea?
> 
> 
>     Many thanks,
> 
>     Ferran
> 
> 
> 
>     ------------------------------------------------------------------------
>     *From:* slurm-users <slurm-users-bounces at lists.schedmd.com
>     <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of Ole
>     Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk
>     <mailto:Ole.H.Nielsen at fysik.dtu.dk>>
>     *Sent:* Tuesday, June 2, 2020 12:03:27 PM
>     *To:* Slurm User Community List
>     *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8
>     Hi Ferran,
> 
>     Please install Slurm software in the standard way, see
>     https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
> 
>     It seems that you have some unusual way to manage your Linux
>     systems.  In
>     Stockholm and Sweden there are many Slurm experts at the HPC centers
>     which
>     might be able to help you more directly.
> 
>     Best regards,
>     Ole
> 
>     On 6/2/20 11:58 AM, Ferran Planas Padros wrote:
>     > I did a fresh installation with the EPEL repo, and installing munge from 
>     > it and it worked. To have the slurm user for munge was definitely a 
>     > problem, but that is the set up we have on the CentOS 6. Now I've learnt 
>     > my lesson for future installations, thanks to everyone!
>     > 
>     > 
>     > Now, I have a follow up question, if you don't mind. I am now trying to 
>     > run slurm, and it crashes:
>     > 
>     > 
>     > [root at roos21 ~]# systemctl status slurm.service
>     > 
>     > *●*slurm.service - LSB: slurm daemon management
>     > 
>     > Loaded: loaded (/etc/rc.d/init.d/slurm; bad; vendor preset: disabled)
>     > 
>     > Active: *failed*(Result: protocol) since Tue 2020-06-02 11:45:33 CEST; 
>     > 3min 33s ago
>     > 
>     > Docs: man:systemd-sysv-generator(8)
>     > 
>     > 
>     > Jun 02 11:45:33 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: Starting
>     LSB: slurm daemon
>     > management...
>     > 
>     > Jun 02 11:45:33 roos21.organ.su.se <http://roos21.organ.su.se> slurm[18223]:
>     starting slurmd: [OK]
>     > 
>     > Jun 02 11:45:33 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: Can't
>     open PID file
>     > /var/run/slurmctld.pid (yet?) after start: No such file or directory
>     > 
>     > Jun 02 11:45:33 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: *Failed
>     to start LSB: slurm
>     > daemon management.*
>     > 
>     > Jun 02 11:45:33 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: *Unit
>     slurm.service entered
>     > failed state.*
>     > 
>     > Jun 02 11:45:33 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]:
>     *slurm.service failed.*
>     > 
>     > 
>     > 
>     > The thing is that this is a computing node, not the master node, so 
>     > slurmctld is not installed. Why do I get this error?
>     > 
>     > 
>     > Many thanks, and my apologies for this rather simple questions. I am a 
>     > newbie on this.
>     > 
>     > 
>     > Best,
>     > 
>     > Ferran
>     > 
>     > --------------------------------------------------------------------------
>     > *From:* slurm-users <slurm-users-bounces at lists.schedmd.com
>     <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of
>     > Renata Maria Dart <renata at slac.stanford.edu <mailto:renata at slac.stanford.edu>>
>     > *Sent:* Friday, May 29, 2020 6:33:58 PM
>     > *To:* Ole.H.Nielsen at fysik.dtu.dk <mailto:Ole.H.Nielsen at fysik.dtu.dk>;
>     Slurm User Community List
>     > *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8
>     > Hi, don't know if this might be your problem but I ran into an issue
>     > on centos 7.8 where /var/run/munge was not being created at boottime
>     > because I didn't have the munge user in the local password file.  I
>     > have the munge user in AD and once the system is up I can start munge
>     > successfully, but AD wasn't available early enough during boot for the
>     > munge startup to see it.  I added these lines to the munge systemctl
>     > file:
>     > 
>     > PermissionsStartOnly=true
>     > ExecStartPre=-/usr/bin/mkdir -m 0755 -p /var/run/munge
>     > ExecStartPre=-/usr/bin/chown -R munge:munge /var/run/munge
>     > 
>     > and my system now starts munge up fine during a reboot.
>     > 
>     > Renata
>     > 
>     > On Fri, 29 May 2020, Ole Holm Nielsen wrote:
>     > 
>     >> Hi Ferran,
>     >>
>     >> When you have a CentOS 7 system with the EPEL repo enabled, and you have
>     >> installed the munge RPM from EPEL, then things should be working correctly.
>     >>
>     >> Since systemctl tells you that Munge service didn't start correctly, then it
>     >> seems to me that you have a problem in the general configuration of your CentOS
>     >> 7 system.  You should check /var/log/messages and "journalctl -xe" for munge
>     >> errors.  It is really hard for other people to guess what may be wrong in your
>     >> system.
>     >>
>     >> My 2 cents worth: Maybe you could make a fresh CentOS 7.8 installation on a
>     >> test system and install the Munge service (and nothing else) according to
>     >> instructions in https://wiki.fysik.dtu.dk/niflheim/Slurm_installation.  This
>     >> *really* has got to work!
>     >>
>     >> /Ole
>     >>
>     >>
>     >> On 29-05-2020 10:23, Ferran Planas Padros wrote:
>     >>> Hello everyone,
>     >>>
>     >>>
>     >>> Here it comes everything I've done.
>     >>>
>     >>>
>     >>> - About Ole's answer:
>     >>>
>     >>> Yes, we have slurm as the user to control munge. Following your comment, I
>     >>> have changed the ownership of the munge files and tried to start munge as
>     >>> munge user. However, it also failed.
>     >>>
>     >>> Also, I first installed munge from a repository. I've seen your suggestion of
>     >>> installing from EPEL. So I uninstalled and installed again. Same result
>     >>>
>     >>> - About SELinux: It is disables
>     >>>
>     >>> - The output of ps -ef | grep munge is:
>     >>>
>     >>>
>     >>> root534051530 10:18 pts/000:00:00 grep --color=auto *munge*
>     >>>
>     >>>
>     >>> - The outputs of munge -n is:
>     >>>
>     >>>
>     >>> Failed to access "/var/run/munge/munge.socket.2": No such file or directory
>     >>>
>     >>>
>     >>> - Same for unmunge
>     >>>
>     >>>
>     >>> - Output for sudo systemctl status --full munge
>     >>>
>     >>>
>     >>> *?*munge.service - MUNGE authentication service
>     >>>
>     >>> Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset:
>     >>> disabled)
>     >>>
>     >>> Active: *failed*(Result: exit-code) since Fri 2020-05-29 10:15:52 CEST; 4min
>     >>> 18s ago
>     >>>
>     >>> Docs: man:munged(8)
>     >>>
>     >>> Process: 5333 ExecStart=/usr/sbin/munged *(code=exited, status=1/FAILURE)*
>     >>>
>     >>>
>     >>> May 29 10:15:52 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: Starting
>     MUNGE authentication
>     >>> service...
>     >>>
>     >>> May 29 10:15:52 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]:
>     *munge.service: control process
>     >>> exited, code=exited status=1*
>     >>>
>     >>> May 29 10:15:52 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: *Failed
>     to start MUNGE
>     >>> authentication service.*
>     >>>
>     >>> May 29 10:15:52 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: *Unit
>     munge.service entered
>     >>> failed state.*
>     >>>
>     >>> May 29 10:15:52 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]:
>     *munge.service failed.*
>     >>>
>     >>>
>     >>> - Regarding NTP, I get this message:
>     >>>
>     >>>
>     >>> Unable to talk to NTP daemon. Is it running?
>     >>>
>     >>>
>     >>> It is the same message I get in the nodes that DO work. All nodes are sync in
>     >>> time and date with the central node
>     >>>
>     >>>
>     >>> ------------------------------------------------------------------------
>     >>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com
>     <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of Ole
>     >>> Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk <mailto:Ole.H.Nielsen at fysik.dtu.dk>>
>     >>> *Sent:* Friday, May 29, 2020 9:56:10 AM
>     >>> *To:* slurm-users at lists.schedmd.com <mailto:slurm-users at lists.schedmd.com>
>     >>> *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8
>     >>> On 29-05-2020 08:46, Sudeep Narayan Banerjee wrote:
>     >>>> also check:
>     >>>> a) whether NTP has been setup and communicating with master node
>     >>>> b) iptables may be flushed (iptables -L)
>     >>>> c) SeLinux to disabled, to check :
>     >>>> getenforce
>     >>>> vim /etc/sysconfig/selinux
>     >>>> (change SELINUX=enforcing to SELINUX=disabled and save the file and reboot)
>     >>>
>     >>> There is no reason to disable SELinux for running the Munge service.
>     >>> It's a pretty bad idea to lower the security just for the sake of
>     >>> convenience!
>     >>>
>     >>> /Ole
>     >>>
>     >>>
>     >>>> On Fri, May 29, 2020 at 12:08 PM Sudeep Narayan Banerjee
>     >>>> <snbanerjee at iitgn.ac.in <mailto:snbanerjee at iitgn.ac.in>
>     <mailto:snbanerjee at iitgn.ac.in>> wrote:
>     >>>>
>     >>>>      I have not checked on the CentOS7.8
>     >>>>      a) if /var/run/munge folder does not exist then please double check
>     >>>>      whether munge has been installed or not
>     >>>>      b) user root or sudo user to do
>     >>>>      ps -ef | grep munge
>     >>>>      kill -9 <PID> //where PID is the Process ID for munge (if the
>     >>>>      process is running at all); else
>     >>>>
>     >>>>      which munged
>     >>>>      /etc/init.d/munge start
>     >>>>
>     >>>>      please let me know the the output of:
>     >>>>
>     >>>>      |$ munge -n|
>     >>>>
>     >>>>      |$ munge -n | unmunge|
>     >>>>
>     >>>>      |$ sudo systemctl status --full munge
>     >>>>
>     >>>>      |
>     >>>>
>     >>>>      Thanks & Regards,
>     >>>>      Sudeep Narayan Banerjee
>     >>>>      System Analyst | Scientist B
>     >>>>      Indian Institute of Technology Gandhinagar
>     >>>>      Gujarat, INDIA
>     >>>>
>     >>>>
>     >>>>      On Fri, May 29, 2020 at 11:55 AM Bjørn-Helge Mevik
>     >>>>      <b.h.mevik at usit.uio.no <mailto:b.h.mevik at usit.uio.no>
>     <mailto:b.h.mevik at usit.uio.no>> wrote:
>     >>>>
>     >>>>          Ferran Planas Padros <ferran.padros at su.se <mailto:ferran.padros at su.se>
>     >>>>          <mailto:ferran.padros at su.se>> writes:
>     >>>>
>     >>>>           > I run the command as slurm user, and the /var/log/munge
>     >>>>          folder does belong to slurm.
>     >>>>
>     >>>>          For security reasons, I strongly advise that you run munged as a
>     >>>>          separate user, which is unprivileged and not used for anything else.
>     >>>>
>     >>>>          --          Regards,
>     >>>>          Bjørn-Helge Mevik, dr. scient,
>     >>>>          Department for Research Computing, University of Oslo
>