Hello Slurm Folks,
I have a weird issue where on the same server, which acts as both a controller and a node, slurmctld can’t find cred_munge.so
slurmctld: debug3: Trying to load plugin /app/slurm-24.0.8/lib/slurm/cred_munge.so slurmctld: debug4: /app/slurm-24.0.8/lib/slurm/cred_munge.so: Does not exist or not a regular file. slurmctld: error: Couldn't find the specified plugin name for cred/munge looking at all files slurmctld: error: cannot open plugin directory /app/slurm-24.0.8/lib/slurm slurmctld: error: cannot find cred plugin for cred/munge slurmctld: error: cannot create cred context for cred/munge slurmctld: fatal: failed to initialize cred plugin
But slurmd can:
slurmd: debug3: Trying to load plugin /app/slurm-24.0.8/lib/slurm/cred_munge.so slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge credential signature plugin type:cred/munge version:0x180800 slurmd: cred/munge: init: Munge credential signature plugin loaded slurmd: debug3: Success.
This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8
Thank you,
Jesse
# slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=prod-cluster SlurmctldHost=controller # #MailProg=/bin/mail #MpiDefault= #MpiParams=ports=#-# ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/spool/slurmctld #SwitchType= TaskPlugin=task/affinity,task/cgroup # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_tres # # # LOGGING AND ACCOUNTING #AccountingStorageType= #JobAcctGatherFrequency=30 #JobAcctGatherType= #SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld.log #SlurmdDebug=info SlurmdLogFile=/var/log/slurmd.log # # # COMPUTE NODES NodeName=controller CPUs=1 State=UNKNOWN NodeName=node CPUs=1 State=UNKNOWN PartitionName=prod-part Nodes=ALL Default=YES MaxTime=INFINITE State=UP
slurmctld runs as the user slurm, whereas slurmd runs as root.
Make sure the permissions on /app/slurm-24.0.8/lib/slurm allow the user slurm to read the files
e.g. you could do (as root)
sudo -u slurm ls /app/slurm-24.0.8/lib/slurm
and see if the slurm user can read the directory (as well as the libraries within it)
Sean ________________________________ From: slurm-users slurm-users-bounces@lists.schedmd.com on behalf of Jesse Aiton jesse@clarkeconsulting.com Sent: Wednesday, 24 January 2024 10:14 To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [EXT] [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files
External email: Please exercise caution
Hello Slurm Folks,
I have a weird issue where on the same server, which acts as both a controller and a node, slurmctld can’t find cred_munge.so
slurmctld: debug3: Trying to load plugin /app/slurm-24.0.8/lib/slurm/cred_munge.so slurmctld: debug4: /app/slurm-24.0.8/lib/slurm/cred_munge.so: Does not exist or not a regular file. slurmctld: error: Couldn't find the specified plugin name for cred/munge looking at all files slurmctld: error: cannot open plugin directory /app/slurm-24.0.8/lib/slurm slurmctld: error: cannot find cred plugin for cred/munge slurmctld: error: cannot create cred context for cred/munge slurmctld: fatal: failed to initialize cred plugin
But slurmd can:
slurmd: debug3: Trying to load plugin /app/slurm-24.0.8/lib/slurm/cred_munge.so slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge credential signature plugin type:cred/munge version:0x180800 slurmd: cred/munge: init: Munge credential signature plugin loaded slurmd: debug3: Success.
This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8
Thank you,
Jesse
# slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=prod-cluster SlurmctldHost=controller # #MailProg=/bin/mail #MpiDefault= #MpiParams=ports=#-# ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/spool/slurmctld #SwitchType= TaskPlugin=task/affinity,task/cgroup # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_tres # # # LOGGING AND ACCOUNTING #AccountingStorageType= #JobAcctGatherFrequency=30 #JobAcctGatherType= #SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld.log #SlurmdDebug=info SlurmdLogFile=/var/log/slurmd.log # # # COMPUTE NODES NodeName=controller CPUs=1 State=UNKNOWN NodeName=node CPUs=1 State=UNKNOWN PartitionName=prod-part Nodes=ALL Default=YES MaxTime=INFINITE State=UP
Hi Sean,
Thank you! It was a permissions issue and it’s not complaining anymore about cred/munge.
I appreciate your help.
Thanks,
Jesse
On Jan 23, 2024, at 3:34 PM, Sean Crosby scrosby@unimelb.edu.au wrote:
slurmctld runs as the user slurm, whereas slurmd runs as root.
Make sure the permissions on /app/slurm-24.0.8/lib/slurm allow the user slurm to read the files
e.g. you could do (as root)
sudo -u slurm ls /app/slurm-24.0.8/lib/slurm
and see if the slurm user can read the directory (as well as the libraries within it)
Sean From: slurm-users <slurm-users-bounces@lists.schedmd.com mailto:slurm-users-bounces@lists.schedmd.com> on behalf of Jesse Aiton <jesse@clarkeconsulting.com mailto:jesse@clarkeconsulting.com> Sent: Wednesday, 24 January 2024 10:14 To: slurm-users@lists.schedmd.com mailto:slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com mailto:slurm-users@lists.schedmd.com> Subject: [EXT] [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files
External email: Please exercise caution
Hello Slurm Folks,
I have a weird issue where on the same server, which acts as both a controller and a node, slurmctld can’t find cred_munge.so
slurmctld: debug3: Trying to load plugin /app/slurm-24.0.8/lib/slurm/cred_munge.so slurmctld: debug4: /app/slurm-24.0.8/lib/slurm/cred_munge.so: Does not exist or not a regular file. slurmctld: error: Couldn't find the specified plugin name for cred/munge looking at all files slurmctld: error: cannot open plugin directory /app/slurm-24.0.8/lib/slurm slurmctld: error: cannot find cred plugin for cred/munge slurmctld: error: cannot create cred context for cred/munge slurmctld: fatal: failed to initialize cred plugin
But slurmd can:
slurmd: debug3: Trying to load plugin /app/slurm-24.0.8/lib/slurm/cred_munge.so slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge credential signature plugin type:cred/munge version:0x180800 slurmd: cred/munge: init: Munge credential signature plugin loaded slurmd: debug3: Success.
This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8
Thank you,
Jesse
# slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=prod-cluster SlurmctldHost=controller # #MailProg=/bin/mail #MpiDefault= #MpiParams=ports=#-# ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/spool/slurmctld #SwitchType= TaskPlugin=task/affinity,task/cgroup # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_tres # # # LOGGING AND ACCOUNTING #AccountingStorageType= #JobAcctGatherFrequency=30 #JobAcctGatherType= #SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld.log #SlurmdDebug=info SlurmdLogFile=/var/log/slurmd.log # # # COMPUTE NODES NodeName=controller CPUs=1 State=UNKNOWN NodeName=node CPUs=1 State=UNKNOWN PartitionName=prod-part Nodes=ALL Default=YES MaxTime=INFINITE State=UP
On Jan 23, 2024, at 18:14, Jesse Aiton jesse@clarkeconsulting.com wrote:
This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8
Thank you,
Jesse
I’m not sure what version you’re actually running, but I don’t believe there is a 24.0.8. The latest version I’m aware of is 23.11.2.
-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'
Yeah, 24.0.8 is the bleeding edge version. I wanted to try the latest in case it was a bug in 20.x.x. I’m happy to go back to any older Slurm version but I don’t think that will matter much if the issue occurs on both Slurm 20 and Slurm 24.
git clone https://github.com/SchedMD/slurm.git Thanks,
Jesse
On Jan 23, 2024, at 4:07 PM, Ryan Novosielski novosirj@rutgers.edu wrote:
On Jan 23, 2024, at 18:14, Jesse Aiton jesse@clarkeconsulting.com wrote:
This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8
Thank you,
Jesse
I’m not sure what version you’re actually running, but I don’t believe there is a 24.0.8. The latest version I’m aware of is 23.11.2.
-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'
Ah, I see — no, it’s 24.08. That’s why I didn’t find any reference to it.
Carry on! :-D
-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'
On Jan 23, 2024, at 19:13, Jesse Aiton jesse@clarkeconsulting.com wrote:
Yeah, 24.0.8 is the bleeding edge version. I wanted to try the latest in case it was a bug in 20.x.x. I’m happy to go back to any older Slurm version but I don’t think that will matter much if the issue occurs on both Slurm 20 and Slurm 24.
git clone https://github.com/SchedMD/slurm.git
Thanks,
Jesse
On Jan 23, 2024, at 4:07 PM, Ryan Novosielski novosirj@rutgers.edu wrote:
On Jan 23, 2024, at 18:14, Jesse Aiton jesse@clarkeconsulting.com wrote:
This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8
Thank you,
Jesse
I’m not sure what version you’re actually running, but I don’t believe there is a 24.0.8. The latest version I’m aware of is 23.11.2.
-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'