[slurm-users] Problem with permisions. CentOS 7.8

Jim Prewett download at carc.unm.edu
Tue Jun 2 17:32:38 UTC 2020


Hi Ferran,

You're right that editing the files under /run/systemd will not persist 
after rebooting.  I'm pretty sure the files that you're looking for are in 
/usr/lib/systemd/system

This page has a nice writeup on the locations of the systemd-related 
files:
https://www.digitalocean.com/community/tutorials/understanding-systemd-units-and-unit-files

It suggests that you put any modified files into /etc/systemd/system/.

HTH,
Jim

On Tue, 2 Jun 2020, Ferran Planas Padros wrote:

> Hi,
>
>
> Thanks for your answer,
>
>
> However, I am setting up a calculating node, not the master node, and thus I have not installed slurmctld on it.
>
>
> After some digging, I have found that all these files:
>
> /run/systemd/generator.late/slurm.service
>
> /run/systemd/generator.late/runlevel5.target.wants/slurm.service
>
> /run/systemd/generator.late/runlevel4.target.wants/slurm.service
>
> /run/systemd/generator.late/runlevel3.target.wants/slurm.service
>
> /run/systemd/generator.late/runlevel2.target.wants/slurm.service
>
>
> Which are a copy of each other and are generated by systemd-sysv-generator, point to the slurmctld.pid, not to the slurm.pid
>
>
> [Unit]
>
> Documentation=man:systemd-sysv-generator(8)
>
> SourcePath=/etc/rc.d/init.d/slurm
>
> Description=LSB: slurm daemon management
>
> Before=runlevel2.target
>
> Before=runlevel3.target
>
> Before=runlevel4.target
>
> Before=runlevel5.target
>
> Before=shutdown.target
>
> After=remote-fs.target
>
> After=network-online.target
>
> After=munge.service
>
> After=nss-lookup.target
>
> After=network-online.target
>
> Wants=network-online.target
>
> Conflicts=shutdown.target
>
>
> [Service]
>
> Type=forking
>
> Restart=no
>
> TimeoutSec=5min
>
> IgnoreSIGPIPE=no
>
> KillMode=process
>
> GuessMainPID=no
>
> RemainAfterExit=no
>
> PIDFile=/var/run/slurmctld.pid
>
> ExecStart=/etc/rc.d/init.d/slurm start
>
> ExecStop=/etc/rc.d/init.d/slurm stop
>
> ~
>
>
>
> How can I make it to avoid this? Besides editing the files manually, which will go back to the original after reboot.
>
>
> Thanks,
>
> Ferran
>
>
> ________________________________
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Rodrigo Santibáñez <rsantibanez.uchile at gmail.com>
> Sent: Tuesday, June 2, 2020 6:40:48 PM
> To: Slurm User Community List
> Subject: Re: [slurm-users] Problem with permisions. CentOS 7.8
>
> Yes, you have both daemons, installed with the slurm rpm.The slurmd (all nodes) communicates with slurmctld (runs in the main master node and, optionally, in a backup node).
>
> You do not need to run slurmd as the slurm user. Use `systemctld enable slurmctld` (and slurmd) followed by `systemclt start slurmctld`. Use restart instead of start if you change the configuration only if `sudo scontrol reconfigure` asks for it.
>
> If you run as root `slurmctld -Dvvvv` and `slurmd -Dvvvv` you'll see debug outputs to see further problems with configuration. The slurmd needs slurmctld running or will output "error: Unable to register: Unable to contact slurm controller (connect failure)"
>
> You should find the services here:
> -rw-r--r-- 1 root root 339 may 30 20:18 /usr/lib/systemd/system/slurmctld.service
> -rw-r--r-- 1 root root 342 may 30 20:18 /usr/lib/systemd/system/slurmdbd.service
> -rw-r--r-- 1 root root 398 may 30 20:18 /usr/lib/systemd/system/slurmd.service
>
> Feel free to ask for more information,
> Best regards
>
> El mar., 2 jun. 2020 a las 11:12, Ferran Planas Padros (<ferran.padros at su.se<mailto:ferran.padros at su.se>>) escribió:
>
>
> Hi Ole,
>
>
> Thanks for your answer and your time. I'd appreciate if you, or someone else, could make a final look at my case.
>
> After your suggestions and comments, I have re-done the whole installation for Munge and Slurm. I uninstalled and remoced all previous rpms and restarted from scratch. Munge works with no problem, however it does not happen the same with slurm (for which I have used the instructions given in the link you attached)
>
>
> - If I run /usr/bin/slurmd -D vvvvv as root user, I get the verbose until the line 'slurmd: debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)' where the verbose stops. After I do Ctrl+C, I get
>
>
> slurmd: all threads complete
>
> slurmd: Consumable Resources (CR) Node Selection plugin shutting down ...
>
> slurmd: Munge cryptographic signature plugin unloaded
>
> slurmd: Slurmd shutdown completing
>
>
> - After that, if I run 'systemctl start slurmd' and 'systemctl status slurmd', also as root user, I get:
>
> ● slurmd.service - Slurm node daemon
>
>   Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: disabled)
>
>   Active: active (running) since Tue 2020-06-02 16:53:51 CEST; 33s ago
>
>  Process: 2750 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
>
> Main PID: 2752 (slurmd)
>
>   CGroup: /system.slice/slurmd.service
>
>           └─2752 /usr/sbin/slurmd -d /usr/sbin/slurmstepd
>
>
> Jun 02 16:53:51 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: Starting Slurm node daemon...
>
> Jun 02 16:53:51 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: Can't open PID file /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory
>
> Jun 02 16:53:51 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: Started Slurm node daemon.
>
>
> - Next, I kill the slurmd process, and I run, as slurm user, 'systemctl start slurm'. Which does not work and returns the following in the journalctl -xe:
>
>
> Jun 02 16:56:01 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: Starting LSB: slurm daemon management...
>
> -- Subject: Unit slurm.service has begun start-up
>
> -- Defined-By: systemd
>
> -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
>
> --
>
> -- Unit slurm.service has begun starting up.
>
> Jun 02 16:56:01 roos21.organ.su.se<http://roos21.organ.su.se> slurm[2805]: starting slurmd: [  OK  ]
>
> Jun 02 16:56:01 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: Can't open PID file /var/run/slurmctld.pid (yet?) after start: No such file or directory
>
> Jun 02 16:56:37 roos21.organ.su.se<http://roos21.organ.su.se> polkitd[1316]: Unregistered Authentication Agent for unix-process:2792:334647 (system bus name :1.46, object path /org/freedesktop
>
> Jun 02 16:56:38 roos21.organ.su.se<http://roos21.organ.su.se> sudo[2790]: pam_unix(sudo:session): session closed for user slurm
>
>
> Something that I don't really understand because I have not installed slurmctld. The slurmctld.service file does not even exist.
>
>
> Any idea?
>
>
> Many thanks,
>
> Ferran
>
>
>
> ________________________________
> From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk<mailto:Ole.H.Nielsen at fysik.dtu.dk>>
> Sent: Tuesday, June 2, 2020 12:03:27 PM
> To: Slurm User Community List
> Subject: Re: [slurm-users] Problem with permisions. CentOS 7.8
>
> Hi Ferran,
>
> Please install Slurm software in the standard way, see
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
>
> It seems that you have some unusual way to manage your Linux systems.  In
> Stockholm and Sweden there are many Slurm experts at the HPC centers which
> might be able to help you more directly.
>
> Best regards,
> Ole
>
> On 6/2/20 11:58 AM, Ferran Planas Padros wrote:
>> I did a fresh installation with the EPEL repo, and installing munge from
>> it and it worked. To have the slurm user for munge was definitely a
>> problem, but that is the set up we have on the CentOS 6. Now I've learnt
>> my lesson for future installations, thanks to everyone!
>>
>>
>> Now, I have a follow up question, if you don't mind. I am now trying to
>> run slurm, and it crashes:
>>
>>
>> [root at roos21 ~]# systemctl status slurm.service
>>
>> *●*slurm.service - LSB: slurm daemon management
>>
>> Loaded: loaded (/etc/rc.d/init.d/slurm; bad; vendor preset: disabled)
>>
>> Active: *failed*(Result: protocol) since Tue 2020-06-02 11:45:33 CEST;
>> 3min 33s ago
>>
>> Docs: man:systemd-sysv-generator(8)
>>
>>
>> Jun 02 11:45:33 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: Starting LSB: slurm daemon
>> management...
>>
>> Jun 02 11:45:33 roos21.organ.su.se<http://roos21.organ.su.se> slurm[18223]: starting slurmd: [OK]
>>
>> Jun 02 11:45:33 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: Can't open PID file
>> /var/run/slurmctld.pid (yet?) after start: No such file or directory
>>
>> Jun 02 11:45:33 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: *Failed to start LSB: slurm
>> daemon management.*
>>
>> Jun 02 11:45:33 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: *Unit slurm.service entered
>> failed state.*
>>
>> Jun 02 11:45:33 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: *slurm.service failed.*
>>
>>
>>
>> The thing is that this is a computing node, not the master node, so
>> slurmctld is not installed. Why do I get this error?
>>
>>
>> Many thanks, and my apologies for this rather simple questions. I am a
>> newbie on this.
>>
>>
>> Best,
>>
>> Ferran
>>
>> --------------------------------------------------------------------------
>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of
>> Renata Maria Dart <renata at slac.stanford.edu<mailto:renata at slac.stanford.edu>>
>> *Sent:* Friday, May 29, 2020 6:33:58 PM
>> *To:* Ole.H.Nielsen at fysik.dtu.dk<mailto:Ole.H.Nielsen at fysik.dtu.dk>; Slurm User Community List
>> *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8
>> Hi, don't know if this might be your problem but I ran into an issue
>> on centos 7.8 where /var/run/munge was not being created at boottime
>> because I didn't have the munge user in the local password file.  I
>> have the munge user in AD and once the system is up I can start munge
>> successfully, but AD wasn't available early enough during boot for the
>> munge startup to see it.  I added these lines to the munge systemctl
>> file:
>>
>> PermissionsStartOnly=true
>> ExecStartPre=-/usr/bin/mkdir -m 0755 -p /var/run/munge
>> ExecStartPre=-/usr/bin/chown -R munge:munge /var/run/munge
>>
>> and my system now starts munge up fine during a reboot.
>>
>> Renata
>>
>> On Fri, 29 May 2020, Ole Holm Nielsen wrote:
>>
>>> Hi Ferran,
>>>
>>> When you have a CentOS 7 system with the EPEL repo enabled, and you have
>>> installed the munge RPM from EPEL, then things should be working correctly.
>>>
>>> Since systemctl tells you that Munge service didn't start correctly, then it
>>> seems to me that you have a problem in the general configuration of your CentOS
>>> 7 system.  You should check /var/log/messages and "journalctl -xe" for munge
>>> errors.  It is really hard for other people to guess what may be wrong in your
>>> system.
>>>
>>> My 2 cents worth: Maybe you could make a fresh CentOS 7.8 installation on a
>>> test system and install the Munge service (and nothing else) according to
>>> instructions in https://wiki.fysik.dtu.dk/niflheim/Slurm_installation.  This
>>> *really* has got to work!
>>>
>>> /Ole
>>>
>>>
>>> On 29-05-2020 10:23, Ferran Planas Padros wrote:
>>>> Hello everyone,
>>>>
>>>>
>>>> Here it comes everything I've done.
>>>>
>>>>
>>>> - About Ole's answer:
>>>>
>>>> Yes, we have slurm as the user to control munge. Following your comment, I
>>>> have changed the ownership of the munge files and tried to start munge as
>>>> munge user. However, it also failed.
>>>>
>>>> Also, I first installed munge from a repository. I've seen your suggestion of
>>>> installing from EPEL. So I uninstalled and installed again. Same result
>>>>
>>>> - About SELinux: It is disables
>>>>
>>>> - The output of ps -ef | grep munge is:
>>>>
>>>>
>>>> root534051530 10:18 pts/000:00:00 grep --color=auto *munge*
>>>>
>>>>
>>>> - The outputs of munge -n is:
>>>>
>>>>
>>>> Failed to access "/var/run/munge/munge.socket.2": No such file or directory
>>>>
>>>>
>>>> - Same for unmunge
>>>>
>>>>
>>>> - Output for sudo systemctl status --full munge
>>>>
>>>>
>>>> *?*munge.service - MUNGE authentication service
>>>>
>>>> Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset:
>>>> disabled)
>>>>
>>>> Active: *failed*(Result: exit-code) since Fri 2020-05-29 10:15:52 CEST; 4min
>>>> 18s ago
>>>>
>>>> Docs: man:munged(8)
>>>>
>>>> Process: 5333 ExecStart=/usr/sbin/munged *(code=exited, status=1/FAILURE)*
>>>>
>>>>
>>>> May 29 10:15:52 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: Starting MUNGE authentication
>>>> service...
>>>>
>>>> May 29 10:15:52 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: *munge.service: control process
>>>> exited, code=exited status=1*
>>>>
>>>> May 29 10:15:52 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: *Failed to start MUNGE
>>>> authentication service.*
>>>>
>>>> May 29 10:15:52 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: *Unit munge.service entered
>>>> failed state.*
>>>>
>>>> May 29 10:15:52 roos21.organ.su.se<http://roos21.organ.su.se> systemd[1]: *munge.service failed.*
>>>>
>>>>
>>>> - Regarding NTP, I get this message:
>>>>
>>>>
>>>> Unable to talk to NTP daemon. Is it running?
>>>>
>>>>
>>>> It is the same message I get in the nodes that DO work. All nodes are sync in
>>>> time and date with the central node
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of Ole
>>>> Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk<mailto:Ole.H.Nielsen at fysik.dtu.dk>>
>>>> *Sent:* Friday, May 29, 2020 9:56:10 AM
>>>> *To:* slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>
>>>> *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8
>>>> On 29-05-2020 08:46, Sudeep Narayan Banerjee wrote:
>>>>> also check:
>>>>> a) whether NTP has been setup and communicating with master node
>>>>> b) iptables may be flushed (iptables -L)
>>>>> c) SeLinux to disabled, to check :
>>>>> getenforce
>>>>> vim /etc/sysconfig/selinux
>>>>> (change SELINUX=enforcing to SELINUX=disabled and save the file and reboot)
>>>>
>>>> There is no reason to disable SELinux for running the Munge service.
>>>> It's a pretty bad idea to lower the security just for the sake of
>>>> convenience!
>>>>
>>>> /Ole
>>>>
>>>>
>>>>> On Fri, May 29, 2020 at 12:08 PM Sudeep Narayan Banerjee
>>>>> <snbanerjee at iitgn.ac.in<mailto:snbanerjee at iitgn.ac.in> <mailto:snbanerjee at iitgn.ac.in>> wrote:
>>>>>
>>>>>      I have not checked on the CentOS7.8
>>>>>      a) if /var/run/munge folder does not exist then please double check
>>>>>      whether munge has been installed or not
>>>>>      b) user root or sudo user to do
>>>>>      ps -ef | grep munge
>>>>>      kill -9 <PID> //where PID is the Process ID for munge (if the
>>>>>      process is running at all); else
>>>>>
>>>>>      which munged
>>>>>      /etc/init.d/munge start
>>>>>
>>>>>      please let me know the the output of:
>>>>>
>>>>>      |$ munge -n|
>>>>>
>>>>>      |$ munge -n | unmunge|
>>>>>
>>>>>      |$ sudo systemctl status --full munge
>>>>>
>>>>>      |
>>>>>
>>>>>      Thanks & Regards,
>>>>>      Sudeep Narayan Banerjee
>>>>>      System Analyst | Scientist B
>>>>>      Indian Institute of Technology Gandhinagar
>>>>>      Gujarat, INDIA
>>>>>
>>>>>
>>>>>      On Fri, May 29, 2020 at 11:55 AM Bjørn-Helge Mevik
>>>>>      <b.h.mevik at usit.uio.no<mailto:b.h.mevik at usit.uio.no> <mailto:b.h.mevik at usit.uio.no>> wrote:
>>>>>
>>>>>          Ferran Planas Padros <ferran.padros at su.se<mailto:ferran.padros at su.se>
>>>>>          <mailto:ferran.padros at su.se>> writes:
>>>>>
>>>>>          > I run the command as slurm user, and the /var/log/munge
>>>>>          folder does belong to slurm.
>>>>>
>>>>>          For security reasons, I strongly advise that you run munged as a
>>>>>          separate user, which is unprivileged and not used for anything else.
>>>>>
>>>>>          --          Regards,
>>>>>          Bjørn-Helge Mevik, dr. scient,
>>>>>          Department for Research Computing, University of Oslo
>
>

James E. Prewett                    Jim at Prewett.org download at hpc.unm.edu
Systems Team Leader           LoGS: http://www.hpc.unm.edu/~download/LoGS/
Designated Security Officer         OpenPGP key: pub 1024D/31816D93
HPC Systems Engineer III   UNM HPC  505.277.8210



More information about the slurm-users mailing list