[slurm-users] [External] Slurm Configuration assistance: Unable to use srun after installation (slurm on fedora 33)

Mon Apr 19 12:17:55 UTC 2021

Hi Florian,
Thanks for the valuable reply and help.

My answers to you are in green.

- Do you have an active support contract with SchedMD? AFAIK they only
offer paid support.

*I don't have an active support contact. I just started learning slurm by
installing it on my fedora machine. This is the first time I am installing
and experimenting with slurm kind of software.*

- The error message is pretty straight forward, slurmctld is not running.
Did you start it (systemctl start slurmctld)?

*I did: systemctl start slurmctld and got this message: Failed to start
slurmctld.service: Unit slurmctld.service not found.*

- slurmd needs to run on the node(s) you want to run on as well, and as I'm
guessing you are using localhost for the controller and want to run jobs on
localhost, so slurmctld and slurmd need to be running on localhost.

*systemctl start slurmd*

*Failed to start slurmd.service: Unit slurmd.service not found.*

*Similar to slurmctrld*

- Is munge running?

*Yes. Here is the status:*
*[johnsy at homepc ~]$ systemctl status munge*
*munge.service - MUNGE authentication service*
*     Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled;
vendor preset: disabled)*
*     Active: active (running) since Mon 2021-04-19 07:49:13 EDT; 13min ago
<-- it is always enabled after restart. This log is just after a restart.*
*       Docs: man:munged(8)*
*    Process: 1070 ExecStart=/usr/sbin/munged (code=exited,
status=0/SUCCESS)*
*   Main PID: 1072 (munged)*
*      Tasks: 4 (limit: 76969)*
*     Memory: 1.4M*
*        CPU: 8ms*
*     CGroup: /system.slice/munge.service*
*             └─1072 /usr/sbin/munged*

- May I ask why you're chown-ing pid and logfiles? The slurm user
(typically "slurm") needs to have access to those files. Munge for instance
checks for ownership and complains if something is not correct.

*I tried to follow some instructions mentioned
in: https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#copy-slurm-conf-to-all-nodes
<https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#copy-slurm-conf-to-all-nodes>*
*I thought, as I am installing the slurm as root, the user "johnsy" has to
have ownership permissions.*

- "srun /proc/cpuinfo" will fail, even if slurmctld and slurmd are running,
because /proc/cpuinfo is not an executable file. You may want to insert
"cat" after srun. Another simple test would be "srun hostname"

*I tried : srun hostname and got the following error message: *
*srun: error: Unable to allocate resources: Unable to contact slurm
controller (connect failure)*

*Also tried:*
*systemctl status slurmctld*
*Unit slurmctld.service could not be found.*

Also I tried installing the packaged version:
https://src.fedoraproject.org/rpms/slurm using dnf.
The same problem exists.

Any help in this regard will be appreciated.

Thanks a lot.
Johnsy

On Mon, Apr 19, 2021 at 5:04 AM Florian Zillner <fzillner at lenovo.com> wrote:

> Hi Johnsy,
>
>    1. Do you have an active support contract with SchedMD? AFAIK they
>    only offer paid support.
>    2. The error message is pretty straight forward, slurmctld is not
>    running. Did you start it (systemctl start slurmctld)?
>    3. slurmd needs to run on the node(s) you want to run on as well, and
>    as I'm guessing you are using localhost for the controller and want to run
>    jobs on localhost, so slurmctld and slurmd need to be running on localhost.
>    4. Is munge running?
>    5. May I ask why you're chown-ing pid and logfiles? The slurm user
>    (typically "slurm") needs to have access to those files. Munge for instance
>    checks for ownership and complains if something is not correct.
>    6. "srun /proc/cpuinfo" will fail, even if slurmctld and slurmd are
>    running, because /proc/cpuinfo is not an executable file. You may want to
>    insert "cat" after srun. Another simple test would be "srun hostname"
>
> And, just my personal opinion, if this is your first experiment with
> Slurm, I wouldn't change too much right from the beginning but instead get
> it working first and then change things to your needs. Slurm is also
> available in the EPEL repos, so you could install it using dnf and
> experiment with the packaged version.
>
> Hope this helps,
> Florian
>
>
> ------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Johnsy K. John <johnsyjohnk at gmail.com>
> *Sent:* Monday, 19 April 2021 01:43
> *To:* sales at schedmd.com <sales at schedmd.com>; johnsy john <
> johnsyjohnk at gmail.com>; slurm-users at schedmd.com <slurm-users at schedmd.com>
> *Subject:* [External] [slurm-users] Slurm Configuration assistance:
> Unable to use srun after installation (slurm on fedora 33)
>
> Hello SchedMD team,
>
> I would like to use your slurm workload manager for learning purposes.
> And I tried installing the the software (downloaded from:
> https://www.schedmd.com/downloads.php ) and followed the steps as
> mentioned in:
>
> https://slurm.schedmd.com/download.html
> https://slurm.schedmd.com/quickstart_admin.html
>
>
> My Linux OS is fedora 33, and i tried installing it as root login.
> After installation and configuration as mentioned in page:
> https://slurm.schedmd.com/quickstart_admin.html
> I got some errors when I tried to do srun.
> Details about the installation and use are as follows:
>
> Using root permissions, copied to: /root/installations/
>
> cd /root/installations/
>
> tar --bzip -x -f slurm-20.11.5.tar.bz2
>
> cd slurm-20.11.5/
>
> ./configure --enable-debug --prefix=/usr/local --sysconfdir=/usr/local/etc
>
> make
> make install
>
>
> Following steps are based on
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration
>
> mkdir /var/spool/slurmctld /var/log/slurm
> chown johnsy /var/spool/slurmctld
> chown johnsy /var/log/slurm
> chmod 755 /var/spool/slurmctld /var/log/slurm
>
>  cp /var/run/slurmctld.pid /var/run/slurmd.pid
>
> touch /var/log/slurm/slurmctld.log
> chown johnsy /var/log/slurm/slurmctld.log
>
> touch /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
> chown johnsy /var/log/slurm/slurm_jobacct.log
> /var/log/slurm/slurm_jobcomp.log
>
> ldconfig -n /usr/lib64
>
>
> Now when I tried an example command for trial:
>
> srun /proc/cpuinfo
>
>
> I get the following error:
>
> srun: error: Unable to allocate resources: Unable to contact slurm
> controller (connect failure)
>
>
>
> My configuration file slurm.conf f that i created is:
>
>
> ######################################################################################################
>
>
> ######################################################################################################
>
> # slurm.conf file generated by configurator.html.
>
> # Put this file on all nodes of your cluster.
>
> # See the slurm.conf man page for more information.
>
> #
>
> SlurmctldHost=homepc
>
> #SlurmctldHost=
>
> #
>
> #DisableRootJobs=NO
>
> #EnforcePartLimits=NO
>
> #Epilog=
>
> #EpilogSlurmctld=
>
> #FirstJobId=1
>
> #MaxJobId=999999
>
> #GresTypes=
>
> #GroupUpdateForce=0
>
> #GroupUpdateTime=600
>
> #JobFileAppend=0
>
> #JobRequeue=1
>
> #JobSubmitPlugins=1
>
> #KillOnBadExit=0
>
> #LaunchType=launch/slurm
>
> #Licenses=foo*4,bar
>
> #MailProg=/bin/mail
>
> #MaxJobCount=5000
>
> #MaxStepCount=40000
>
> #MaxTasksPerNode=128
>
> MpiDefault=none
>
> #MpiParams=ports=#-#
>
> #PluginDir=
>
> #PlugStackConfig=
>
> #PrivateData=jobs
>
> ProctrackType=proctrack/cgroup
>
> #Prolog=
>
> #PrologFlags=
>
> #PrologSlurmctld=
>
> #PropagatePrioProcess=0
>
> #PropagateResourceLimits=
>
> #PropagateResourceLimitsExcept=
>
> #RebootProgram=
>
> ReturnToService=1
>
> SlurmctldPidFile=/var/run/slurmctld.pid
>
> SlurmctldPort=6817
>
> SlurmdPidFile=/var/run/slurmd.pid
>
> SlurmdPort=6818
>
> SlurmdSpoolDir=/var/spool/slurmd
>
> SlurmUser=johnsy
>
> #SlurmdUser=root
>
> #SrunEpilog=
>
> #SrunProlog=
>
> StateSaveLocation=/var/spool
>
> SwitchType=switch/none
>
> #TaskEpilog=
>
> TaskPlugin=task/affinity
>
> #TaskProlog=
>
> #TopologyPlugin=topology/tree
>
> #TmpFS=/tmp
>
> #TrackWCKey=no
>
> #TreeWidth=
>
> #UnkillableStepProgram=
>
> #UsePAM=0
>
> #
>
> #
>
> # TIMERS
>
> #BatchStartTimeout=10
>
> #CompleteWait=0
>
> #EpilogMsgTime=2000
>
> #GetEnvTimeout=2
>
> #HealthCheckInterval=0
>
> #HealthCheckProgram=
>
> InactiveLimit=0
>
> KillWait=30
>
> #MessageTimeout=10
>
> #ResvOverRun=0
>
> MinJobAge=300
>
> #OverTimeLimit=0
>
> SlurmctldTimeout=120
>
> SlurmdTimeout=300
>
> #UnkillableStepTimeout=60
>
> #VSizeFactor=0
>
> Waittime=0
>
> #
>
> #
>
> # SCHEDULING
>
> #DefMemPerCPU=0
>
> #MaxMemPerCPU=0
>
> #SchedulerTimeSlice=30
>
> SchedulerType=sched/backfill
>
> SelectType=select/cons_tres
>
> SelectTypeParameters=CR_Core
>
> #
>
> #
>
> # JOB PRIORITY
>
> #PriorityFlags=
>
> #PriorityType=priority/basic
>
> #PriorityDecayHalfLife=
>
> #PriorityCalcPeriod=
>
> #PriorityFavorSmall=
>
> #PriorityMaxAge=
>
> #PriorityUsageResetPeriod=
>
> #PriorityWeightAge=
>
> #PriorityWeightFairshare=
>
> #PriorityWeightJobSize=
>
> #PriorityWeightPartition=
>
> #PriorityWeightQOS=
>
> #
>
> #
>
> # LOGGING AND ACCOUNTING
>
> #AccountingStorageEnforce=0
>
> #AccountingStorageHost=
>
> #AccountingStoragePass=
>
> #AccountingStoragePort=
>
> AccountingStorageType=accounting_storage/none
>
> #AccountingStorageUser=
>
> AccountingStoreJobComment=YES
>
> ClusterName=cluster
>
> #DebugFlags=
>
> #JobCompHost=
>
> #JobCompLoc=
>
> #JobCompPass=
>
> #JobCompPort=
>
> JobCompType=jobcomp/none
>
> #JobCompUser=
>
> #JobContainerType=job_container/none
>
> JobAcctGatherFrequency=30
>
> JobAcctGatherType=jobacct_gather/none
>
> SlurmctldDebug=info
>
> #SlurmctldLogFile=
>
> SlurmdDebug=info
>
> #SlurmdLogFile=
>
> #SlurmSchedLogFile=
>
> #SlurmSchedLogLevel=
>
> #
>
> #
>
> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
>
> #SuspendProgram=
>
> #ResumeProgram=
>
> #SuspendTimeout=
>
> #ResumeTimeout=
>
> #ResumeRate=
>
> #SuspendExcNodes=
>
> #SuspendExcParts=
>
> #SuspendRate=
>
> #SuspendTime=
>
> #
>
> #
>
> # COMPUTE NODES
>
> NodeName=localhost CPUs=12 Sockets=1 CoresPerSocket=6 ThreadsPerCore=2
> State=UNKNOWN
>
> PartitionName=debug Nodes=localhost Default=YES MaxTime=INFINITE State=UP
>
>
> ######################################################################################################
>
>
> ######################################################################################################
>
>
> ######################################################################################################
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210419/4e73af69/attachment-0001.htm>