[slurm-users] Problem with permisions. CentOS 7.8
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Jun 2 20:04:53 UTC 2020
Hi Ferran,
The Slurm RPMs built in the standard way will not cause any errors with
Systemd daemons. You should not have any troubles on a correctly
installed Slurm node. That is why I think you need to look at other
problems in your setup.
Which versions of Slurm do you run?
Which nodes run the old CentOS 6 and which Slurm versions? You may have
to upgrade to CentOS 7.
Please understand that you must not mix very old Slurm versions with new
ones, see
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
The Slurm versions may only be mixed as follows:
slurmdbd >= slurmctld >= slurmd >= commands
So your CentOS 7.8 compute node running slurmd must not have a Slurm
version newer than that of the slurmctld and slurmdbd nodes.
/Ole
On 02-06-2020 18:54, Ferran Planas Padros wrote:
> However, I am setting up a calculating node, not the master node, and
> thus I have not installed slurmctld on it.
>
>
> After some digging, I have found that all these files:
>
> /run/systemd/generator.late/slurm.service
>
> /run/systemd/generator.late/runlevel5.target.wants/slurm.service
>
> /run/systemd/generator.late/runlevel4.target.wants/slurm.service
>
> /run/systemd/generator.late/runlevel3.target.wants/slurm.service
>
> /run/systemd/generator.late/runlevel2.target.wants/slurm.service
>
>
> Which are a copy of each other and are generated by
> systemd-sysv-generator, point to the slurmctld.pid, not to the slurm.pid
>
>
> [Unit]
>
> Documentation=man:systemd-sysv-generator(8)
>
> SourcePath=/etc/rc.d/init.d/slurm
>
> Description=LSB: slurm daemon management
>
> Before=runlevel2.target
>
> Before=runlevel3.target
>
> Before=runlevel4.target
>
> Before=runlevel5.target
>
> Before=shutdown.target
>
> After=remote-fs.target
>
> After=network-online.target
>
> After=munge.service
>
> After=nss-lookup.target
>
> After=network-online.target
>
> Wants=network-online.target
>
> Conflicts=shutdown.target
>
>
> [Service]
>
> Type=forking
>
> Restart=no
>
> TimeoutSec=5min
>
> IgnoreSIGPIPE=no
>
> KillMode=process
>
> GuessMainPID=no
>
> RemainAfterExit=no
>
> *PIDFile=/var/run/slurmctld.pid*
>
> ExecStart=/etc/rc.d/init.d/slurm start
>
> ExecStop=/etc/rc.d/init.d/slurm stop
>
> ~
>
>
>
> How can I make it to avoid this? Besides editing the files manually,
> which will go back to the original after reboot.
>
>
> Thanks,
>
> Ferran
>
>
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Rodrigo Santibáñez <rsantibanez.uchile at gmail.com>
> *Sent:* Tuesday, June 2, 2020 6:40:48 PM
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8
> Yes, you have both daemons, installed with the slurm rpm.The slurmd (all
> nodes) communicates with slurmctld (runs in the main master node and,
> optionally, in a backup node).
>
> You do not need to run slurmd as the slurm user. Use `systemctld enable
> slurmctld` (and slurmd) followed by `systemclt start slurmctld`. Use
> restart instead of start if you change the configuration only if `sudo
> scontrol reconfigure` asks for it.
>
> If you run as root `slurmctld -Dvvvv` and `slurmd -Dvvvv` you'll see
> debug outputs to see further problems with configuration. The slurmd
> needs slurmctld running or will output "error: Unable to register:
> Unable to contact slurm controller (connect failure)"
>
> You should find the services here:
> -rw-r--r-- 1 root root 339 may 30 20:18
> /usr/lib/systemd/system/slurmctld.service
> -rw-r--r-- 1 root root 342 may 30 20:18
> /usr/lib/systemd/system/slurmdbd.service
> -rw-r--r-- 1 root root 398 may 30 20:18
> /usr/lib/systemd/system/slurmd.service
>
> Feel free to ask for more information,
> Best regards
>
> El mar., 2 jun. 2020 a las 11:12, Ferran Planas Padros
> (<ferran.padros at su.se <mailto:ferran.padros at su.se>>) escribió:
>
>
> Hi Ole,
>
>
> Thanks for your answer and your time. I'd appreciate if you, or
> someone else, could make a final look at my case.
>
> After your suggestions and comments, I have re-done the whole
> installation for Munge and Slurm. I uninstalled and remoced all
> previous rpms and restarted from scratch. Munge works with no
> problem, however it does not happen the same with slurm (for which I
> have used the instructions given in the link you attached)
>
>
> - If I run /usr/bin/slurmd -D vvvvv as root user, I get the verbose
> until the line 'slurmd: debug2: No acct_gather.conf file
> (/etc/slurm/acct_gather.conf)' where the verbose stops. After I do
> Ctrl+C, I get
>
>
> slurmd: all threads complete
>
> slurmd: Consumable Resources (CR) Node Selection plugin shutting
> down ...
>
> slurmd: Munge cryptographic signature plugin unloaded
>
> slurmd: Slurmd shutdown completing
>
>
> - After that, if I run 'systemctl start slurmd' and 'systemctl
> status slurmd', also as root user, I get:
>
> *●*slurmd.service - Slurm node daemon
>
> Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
> preset: disabled)
>
> Active: *active (running)*since Tue 2020-06-02 16:53:51 CEST; 33s ago
>
> Process: 2750 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd
> $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
>
> Main PID: 2752 (slurmd)
>
> CGroup: /system.slice/slurmd.service
>
> └─2752 /usr/sbin/slurmd -d /usr/sbin/slurmstepd
>
>
> Jun 02 16:53:51 roos21.organ.su.se <http://roos21.organ.su.se>
> systemd[1]: Starting Slurm node daemon...
>
> Jun 02 16:53:51 roos21.organ.su.se <http://roos21.organ.su.se>
> systemd[1]: Can't open PID file /var/run/slurm/slurmd.pid (yet?)
> after start: No such file or directory
>
> Jun 02 16:53:51 roos21.organ.su.se <http://roos21.organ.su.se>
> systemd[1]: Started Slurm node daemon.
>
>
> - Next, I kill the slurmd process, and I run, as slurm user,
> 'systemctl start slurm'. Which does not work and returns the
> following in the journalctl -xe:
>
>
> Jun 02 16:56:01 roos21.organ.su.se <http://roos21.organ.su.se>
> systemd[1]: Starting LSB: slurm daemon management...
>
> -- Subject: Unit slurm.service has begun start-up
>
> -- Defined-By: systemd
>
> -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
>
> --
>
> -- Unit slurm.service has begun starting up.
>
> Jun 02 16:56:01 roos21.organ.su.se <http://roos21.organ.su.se>
> slurm[2805]: starting slurmd: [OK]
>
> Jun 02 16:56:01 roos21.organ.su.se <http://roos21.organ.su.se>
> systemd[1]: Can't open PID file /var/run/slurmctld.pid (yet?) after
> start: No such file or directory
>
> Jun 02 16:56:37 roos21.organ.su.se <http://roos21.organ.su.se>
> polkitd[1316]: *Unregistered Authentication Agent for
> unix-process:2792:334647 (system bus name :1.46, object path
> /org/freedesktop*
>
> Jun 02 16:56:38 roos21.organ.su.se <http://roos21.organ.su.se>
> sudo[2790]: pam_unix(sudo:session): session closed for user slurm
>
>
> Something that I don't really understand because I have not
> installed slurmctld. The slurmctld.service file does not even exist.
>
>
> Any idea?
>
>
> Many thanks,
>
> Ferran
>
>
>
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com
> <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of Ole
> Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk
> <mailto:Ole.H.Nielsen at fysik.dtu.dk>>
> *Sent:* Tuesday, June 2, 2020 12:03:27 PM
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8
> Hi Ferran,
>
> Please install Slurm software in the standard way, see
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
>
> It seems that you have some unusual way to manage your Linux
> systems. In
> Stockholm and Sweden there are many Slurm experts at the HPC centers
> which
> might be able to help you more directly.
>
> Best regards,
> Ole
>
> On 6/2/20 11:58 AM, Ferran Planas Padros wrote:
> > I did a fresh installation with the EPEL repo, and installing munge from
> > it and it worked. To have the slurm user for munge was definitely a
> > problem, but that is the set up we have on the CentOS 6. Now I've learnt
> > my lesson for future installations, thanks to everyone!
> >
> >
> > Now, I have a follow up question, if you don't mind. I am now trying to
> > run slurm, and it crashes:
> >
> >
> > [root at roos21 ~]# systemctl status slurm.service
> >
> > *●*slurm.service - LSB: slurm daemon management
> >
> > Loaded: loaded (/etc/rc.d/init.d/slurm; bad; vendor preset: disabled)
> >
> > Active: *failed*(Result: protocol) since Tue 2020-06-02 11:45:33 CEST;
> > 3min 33s ago
> >
> > Docs: man:systemd-sysv-generator(8)
> >
> >
> > Jun 02 11:45:33 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: Starting
> LSB: slurm daemon
> > management...
> >
> > Jun 02 11:45:33 roos21.organ.su.se <http://roos21.organ.su.se> slurm[18223]:
> starting slurmd: [OK]
> >
> > Jun 02 11:45:33 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: Can't
> open PID file
> > /var/run/slurmctld.pid (yet?) after start: No such file or directory
> >
> > Jun 02 11:45:33 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: *Failed
> to start LSB: slurm
> > daemon management.*
> >
> > Jun 02 11:45:33 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: *Unit
> slurm.service entered
> > failed state.*
> >
> > Jun 02 11:45:33 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]:
> *slurm.service failed.*
> >
> >
> >
> > The thing is that this is a computing node, not the master node, so
> > slurmctld is not installed. Why do I get this error?
> >
> >
> > Many thanks, and my apologies for this rather simple questions. I am a
> > newbie on this.
> >
> >
> > Best,
> >
> > Ferran
> >
> > --------------------------------------------------------------------------
> > *From:* slurm-users <slurm-users-bounces at lists.schedmd.com
> <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of
> > Renata Maria Dart <renata at slac.stanford.edu <mailto:renata at slac.stanford.edu>>
> > *Sent:* Friday, May 29, 2020 6:33:58 PM
> > *To:* Ole.H.Nielsen at fysik.dtu.dk <mailto:Ole.H.Nielsen at fysik.dtu.dk>;
> Slurm User Community List
> > *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8
> > Hi, don't know if this might be your problem but I ran into an issue
> > on centos 7.8 where /var/run/munge was not being created at boottime
> > because I didn't have the munge user in the local password file. I
> > have the munge user in AD and once the system is up I can start munge
> > successfully, but AD wasn't available early enough during boot for the
> > munge startup to see it. I added these lines to the munge systemctl
> > file:
> >
> > PermissionsStartOnly=true
> > ExecStartPre=-/usr/bin/mkdir -m 0755 -p /var/run/munge
> > ExecStartPre=-/usr/bin/chown -R munge:munge /var/run/munge
> >
> > and my system now starts munge up fine during a reboot.
> >
> > Renata
> >
> > On Fri, 29 May 2020, Ole Holm Nielsen wrote:
> >
> >> Hi Ferran,
> >>
> >> When you have a CentOS 7 system with the EPEL repo enabled, and you have
> >> installed the munge RPM from EPEL, then things should be working correctly.
> >>
> >> Since systemctl tells you that Munge service didn't start correctly, then it
> >> seems to me that you have a problem in the general configuration of your CentOS
> >> 7 system. You should check /var/log/messages and "journalctl -xe" for munge
> >> errors. It is really hard for other people to guess what may be wrong in your
> >> system.
> >>
> >> My 2 cents worth: Maybe you could make a fresh CentOS 7.8 installation on a
> >> test system and install the Munge service (and nothing else) according to
> >> instructions in https://wiki.fysik.dtu.dk/niflheim/Slurm_installation. This
> >> *really* has got to work!
> >>
> >> /Ole
> >>
> >>
> >> On 29-05-2020 10:23, Ferran Planas Padros wrote:
> >>> Hello everyone,
> >>>
> >>>
> >>> Here it comes everything I've done.
> >>>
> >>>
> >>> - About Ole's answer:
> >>>
> >>> Yes, we have slurm as the user to control munge. Following your comment, I
> >>> have changed the ownership of the munge files and tried to start munge as
> >>> munge user. However, it also failed.
> >>>
> >>> Also, I first installed munge from a repository. I've seen your suggestion of
> >>> installing from EPEL. So I uninstalled and installed again. Same result
> >>>
> >>> - About SELinux: It is disables
> >>>
> >>> - The output of ps -ef | grep munge is:
> >>>
> >>>
> >>> root534051530 10:18 pts/000:00:00 grep --color=auto *munge*
> >>>
> >>>
> >>> - The outputs of munge -n is:
> >>>
> >>>
> >>> Failed to access "/var/run/munge/munge.socket.2": No such file or directory
> >>>
> >>>
> >>> - Same for unmunge
> >>>
> >>>
> >>> - Output for sudo systemctl status --full munge
> >>>
> >>>
> >>> *?*munge.service - MUNGE authentication service
> >>>
> >>> Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset:
> >>> disabled)
> >>>
> >>> Active: *failed*(Result: exit-code) since Fri 2020-05-29 10:15:52 CEST; 4min
> >>> 18s ago
> >>>
> >>> Docs: man:munged(8)
> >>>
> >>> Process: 5333 ExecStart=/usr/sbin/munged *(code=exited, status=1/FAILURE)*
> >>>
> >>>
> >>> May 29 10:15:52 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: Starting
> MUNGE authentication
> >>> service...
> >>>
> >>> May 29 10:15:52 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]:
> *munge.service: control process
> >>> exited, code=exited status=1*
> >>>
> >>> May 29 10:15:52 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: *Failed
> to start MUNGE
> >>> authentication service.*
> >>>
> >>> May 29 10:15:52 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]: *Unit
> munge.service entered
> >>> failed state.*
> >>>
> >>> May 29 10:15:52 roos21.organ.su.se <http://roos21.organ.su.se> systemd[1]:
> *munge.service failed.*
> >>>
> >>>
> >>> - Regarding NTP, I get this message:
> >>>
> >>>
> >>> Unable to talk to NTP daemon. Is it running?
> >>>
> >>>
> >>> It is the same message I get in the nodes that DO work. All nodes are sync in
> >>> time and date with the central node
> >>>
> >>>
> >>> ------------------------------------------------------------------------
> >>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com
> <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of Ole
> >>> Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk <mailto:Ole.H.Nielsen at fysik.dtu.dk>>
> >>> *Sent:* Friday, May 29, 2020 9:56:10 AM
> >>> *To:* slurm-users at lists.schedmd.com <mailto:slurm-users at lists.schedmd.com>
> >>> *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8
> >>> On 29-05-2020 08:46, Sudeep Narayan Banerjee wrote:
> >>>> also check:
> >>>> a) whether NTP has been setup and communicating with master node
> >>>> b) iptables may be flushed (iptables -L)
> >>>> c) SeLinux to disabled, to check :
> >>>> getenforce
> >>>> vim /etc/sysconfig/selinux
> >>>> (change SELINUX=enforcing to SELINUX=disabled and save the file and reboot)
> >>>
> >>> There is no reason to disable SELinux for running the Munge service.
> >>> It's a pretty bad idea to lower the security just for the sake of
> >>> convenience!
> >>>
> >>> /Ole
> >>>
> >>>
> >>>> On Fri, May 29, 2020 at 12:08 PM Sudeep Narayan Banerjee
> >>>> <snbanerjee at iitgn.ac.in <mailto:snbanerjee at iitgn.ac.in>
> <mailto:snbanerjee at iitgn.ac.in>> wrote:
> >>>>
> >>>> I have not checked on the CentOS7.8
> >>>> a) if /var/run/munge folder does not exist then please double check
> >>>> whether munge has been installed or not
> >>>> b) user root or sudo user to do
> >>>> ps -ef | grep munge
> >>>> kill -9 <PID> //where PID is the Process ID for munge (if the
> >>>> process is running at all); else
> >>>>
> >>>> which munged
> >>>> /etc/init.d/munge start
> >>>>
> >>>> please let me know the the output of:
> >>>>
> >>>> |$ munge -n|
> >>>>
> >>>> |$ munge -n | unmunge|
> >>>>
> >>>> |$ sudo systemctl status --full munge
> >>>>
> >>>> |
> >>>>
> >>>> Thanks & Regards,
> >>>> Sudeep Narayan Banerjee
> >>>> System Analyst | Scientist B
> >>>> Indian Institute of Technology Gandhinagar
> >>>> Gujarat, INDIA
> >>>>
> >>>>
> >>>> On Fri, May 29, 2020 at 11:55 AM Bjørn-Helge Mevik
> >>>> <b.h.mevik at usit.uio.no <mailto:b.h.mevik at usit.uio.no>
> <mailto:b.h.mevik at usit.uio.no>> wrote:
> >>>>
> >>>> Ferran Planas Padros <ferran.padros at su.se <mailto:ferran.padros at su.se>
> >>>> <mailto:ferran.padros at su.se>> writes:
> >>>>
> >>>> > I run the command as slurm user, and the /var/log/munge
> >>>> folder does belong to slurm.
> >>>>
> >>>> For security reasons, I strongly advise that you run munged as a
> >>>> separate user, which is unprivileged and not used for anything else.
> >>>>
> >>>> -- Regards,
> >>>> Bjørn-Helge Mevik, dr. scient,
> >>>> Department for Research Computing, University of Oslo
>
More information about the slurm-users
mailing list